Using feedforward convolutional neural networks (ConvNets) to solve computer vision problems is the most well-known achievement of deep learning, but a small number of public attention has been devoted to using recurrent neural networks to model time relations. According to the elaboration of deep learning, LSTM network has been proved to be more effective than traditional RNNs. This paper is written by Zachary chase Lipton, a doctoral student who studies the theory and application of machine learning at the University of California, San Diego (UCSD). It explains the basic knowledge of convolutional networks in simple language, and introduces the long-term and short-term memory (LSTM) model.

In view of the wide applicability of deep learning in real tasks, it has attracted the attention of many technical experts, investors and non professionals. Although the most famous achievement of deep learning is the use of feedforward convolutional neural networks (ConvNets) to solve computer vision problems, a small number of public attention has been devoted to the use of recurrent neural networks to model time relations.

(Note: to help you start to experience LSTM recursive network, I attached a simple micro instance with numpy, theano and Jonathan raiman’s LSTM sample git clone pre installed.)

In a recent article, “learning to read recurrent neural networks,” I explained why, despite the incredible success of feedforward networks, they are constrained by the inability to explicitly simulate time relationships and the assumption that all data points are made up of fixed length vectors. In the conclusion part of that article, I promised to write a new article to explain the basic knowledge of convolutional networks and introduce the long-term and short-term memory (LSTM) model.

First, introduce the basic knowledge of neural network. A neural network can be represented as a graph of an artificial neuron, or nodes and directed edges, to model synapses. Each neuron is a processing unit, which takes the output of the node connected to it as the input. Each neuron applies a nonlinear activation function before it outputs. Because of this activation function, neural network has the ability to model nonlinear relationship.

Now, consider the recent famous paper playing Atari with deep reinforcement learning, which combines ConvNets and reinforcement learning to train computers to play video games. The system has surpassed human performance in some games, such as breakout!, The appropriate strategy of this game at any time can be inferred by looking at the screen. However, when the optimization strategy needs to be planned in a long time span, the performance of the system is far from that of human beings, such as space invaders.

Therefore, we introduce a recurrent neural network (RNN), which gives the neural network the ability to model time explicitly by adding a self connected hidden layer across time points. In other words, the feedback of the hidden layer not only enters the output, but also enters the hidden layer of the next time step. In this article, I’ll use some diagrams of recursive networks to extract from the literature on this subject that I’m going to review.

Now, we can expand the network in two time steps to visualize the connection as a loop free form. Note that the weights (from input to hide and from hide to output) are the same at each time step. Recursive networks are sometimes described as deep networks, whose depth occurs not only between input and output, but also across time steps, and each time step can be considered as a layer.

Once deployed, these networks can use back propagation for end-to-end training. This cross time step back propagation extension is called back propagation through time.

However, there is a problem mentioned in yoshua bengio’s frequently cited paper (learning long term dependencies with gradient descent is difficulty), that is, the gradient of disappearance. In other words, the error signal in the later time step often can’t go back far enough to affect the network as in the earlier time step. This makes it difficult to learn about the effects of distance, such as the little pawn who let go will come back to you after 12 steps.

The remedy for this problem is the long-term and short-term memory (LSTM) model first proposed by Sepp Hochreiter and Jurgen schmidhub in 1997. In this model, the conventional neuron, a unit that applies S-type activation to its linear combination of inputs, is replaced by a storage unit. Each memory cell is associated with an input gate, an output gate and an internal state transmitted to itself without interference across time steps.

In this model, for each storage cell, three sets of weights are obtained from input training, including the complete hidden state in the previous time step. One feed to the input node, at the bottom of the image above. A feed to the input gate is displayed at the bottom of the rightmost cell. Another feed to the output gate, shown at the top right most. Each blue node is associated with an activation function, typically an S-shaped function, and a PI node representing multiplication. The most central node in the cell is called the internal state, and the weight of 1 spans the time step and feeds back to itself. The self connecting side of the internal state is called the constant error conveyor or CEC.

As far as forward pass is concerned, the input gate learns when to activate the incoming storage unit, while the output gate learns when to activate the outgoing storage unit. Correspondingly, with regard to post delivery, the output gate learns when to let the error flow into the storage unit, while the input gate learns when to let it flow out of the storage unit and pass it to the rest of the network. These models have been proved to be very successful in a variety of handwriting recognition and image captioning tasks. Maybe with more love, they can win over the space invaders.