In feedforward neural network , Information transmission is one-way , It can be regarded as a complex function , The output of the network only depends on the current input . Unable to process timing data .
However, cyclic neural network is a kind of neural network with short-term memory ability . In cyclic neural network , Neurons can not only receive information from other neurons , Can also accept their own information , Form a network structure with loop . The parameters of recurrent neural network can be learned by back propagation algorithm over time . Back propagation algorithm with time can transfer error information step by step according to the reverse order of time . But when the input sequence is long , There may be gradient explosion and gradient disappearance . Gradient explosion can be cut by gradient method , But when the gradient disappears, long-term and short-term neural networks need to be introduced （ gating mechanism ）.
There are three ways to add memory to the network
* Delayed neural network , That is to say, in the non output layer of feedforward network, time delay is added , Record the output of neurons in recent times . In this way, the delayed neural network shares the weight in time dimension , Can reduce the number of parameters .
* A nonlinear autoregressive model with external input , At every moment t All have an external input to produce an output , And record the latest external input and output through the delayer .
Cyclic neural networks use neurons with self feedback , Able to process any length of time series data . Given an input sequence , Cyclic neural network updates the active value of hidden layer with feedback edge by inputting the active value of the previous time and the input sequence of this time .
The state of hidden layer in recurrent neural network is not only related to the current time input sequence , It's also related to the hidden layer state at the last moment . And use sigimod Activate function activate . If we regard the state of every moment as a layer of feedforward neural network , Cyclic neural network can be regarded as a neural network with weight sharing in time dimension . That is to say, three sets of weight sharing in time dimension of recurrent neural network .
Because of its short-term memory ability , Equivalent to storage device , So its computing power is very strong . Feedforward neural network can simulate any continuous function , The cyclic neural network can simulate any program .
According to the general approximation theorem , Two layer feedforward neural network can approximate any continuous function on any bounded closed set . therefore , Two functions of dynamic system can be approximated by two-layer feedforward neural network .
All Turing machines Can be used by a Sigmoid The simulation is based on a fully connected cyclic network composed of neurons of type A activation function . A fully connected recurrent neural network can approximately solve all computable problems .
Recurrent neural network is divided into ： Sequence to category mode , Synchronous sequence to sequence mode , Asynchronous Sequence to sequence mode .
Parameters of recurrent neural network can be learned by gradient descent method . There are two main ways to calculate the gradient ： along with Time back propagation （BPTT） Algorithm and real-time circular learning （RTRL） algorithm .
The main idea of back propagation algorithm over time is to calculate the gradient through the error back propagation algorithm similar to feedforward neural network . It is to treat the recurrent neural network as an expanded multilayer feedforward network , among “ Each floor ” Corresponding to the “ Every moment ”. such , The gradient can be calculated by the back propagation algorithm . And because the parameters are shared , All are the sum of the gradient parameters of each layer .
Real time circular learning （Real-Time Recurrent Learning,RTRL） The gradient is calculated by forward propagation
RTRL Algorithm and BPTT All algorithms are based on gradient descent , Before passing Applying the chain rule to calculate the gradient in the direction and direction patterns . In cyclic neural network , General network output
Dimension is much lower than input dimension , therefore BPTT The calculation of the algorithm will be less , however BPTT Algorithm needs to protect Save the middle gradient of all times , High space complexity .RTRL The algorithm does not need gradient echo , So not
Often suitable for tasks requiring online learning or infinite sequence .