I'd also add three recent very interesting papers:

On Characterizing the Capacity of Neural Networks using Algebraic Topology - https://arxiv.org/abs/1802.04443

Backprop as Functor - https://arxiv.org/abs/1711.10455

The simple essence of automatic differentiation - https://arxiv.org/abs/1804.00746