RNN It is a kind of neural network used to process sequence data
Sequence data ： Time series data refers to the data collected at different time points , This kind of data reflects something , The state or degree of change over time, as of phenomena .
The basic neural network only establishes weight connection between layers ,RNN The biggest difference is the right connections between the neurons between the layers .
The picture above is a standard one RNN Structure diagram , Each arrow represents a transformation , In other words, the arrow connection has a weight . The left side is folded up , On the right, it's unfolded , Left middle h The arrow next to represents the “ loop ” It is reflected in the hidden layer .
RNN There are also the following features ：
1 Weight sharing , In the picture W All the same ,U and V It's the same thing
2 Each input value is only connected with its own route , It doesn't connect to other neurons
RNN A variety of
standard RNN Forward output process of
x It's input ,h It is a hidden layer unit ,o Is the output ,L Is the loss function ,y Label for training set . These elements are in the upper right corner t representative t The state of the moment , unit h stay t The performance of a moment is not only determined by the input of the moment , I'm still t Influence of moments before moments .V,W,U It's the weight , The weight of the same type of weight connection is the same . With the above understanding , The forward propagation algorithm is actually very simple , about t time ：
RNN Training methods ——BPTT
time） Algorithm is a common training RNN Methods , In fact, the essence is still BP algorithm .BPTT The central idea of BP The algorithm is the same , Along the negative gradient direction of the parameters to be optimized, the better points are continuously searched until they converge . in summary ,BPTT Algorithm essential gradient descent method , Then the core of this algorithm is to find the gradient of each parameter .
You may object ,RNN It is different from deep neural network ,RNN Are shared , And the gradient of a certain moment is the sum of this time and the previous time , Even if it doesn't reach the deepest point , That shallow layer also has gradient . This is, of course, right , But if we update the shared parameters of more layers according to the gradient of finite layers, there will be problems .
Gradient disappearance or gradient explosion
Long term dependence
When the interval increases ,RNN You lose the ability to learn information that connects so far , There is a problem of long-term dependence
All RNN They all have a chain form of repetitive neural network modules . In the standard RNN species , This repeated module has only one very simple structure , For example, there is one tanh layer .
LSTM It's the same structure , But repeated modules have a different structure . Different from a single neural network layer , In addition to h Flow over time , Cell state c It's also flowing over time , Cell state c It means long-term memory .
LSTM The core idea of
LSTM The key is cell state , The horizontal line runs through the top of the graph .
Cell state , It's like a conveyor belt . Run directly across the chain , There are only a few linear interactions . It's easy to keep the message going .
LSTM There are those that have been carefully designed as “ door ” The ability to remove or add information to the cellular state . Gate is a way to let information go through selectively . They include a sigmoid Neural network layer and a pointwise Multiplication operation .
LSTM It has three doors ： Forgetting gate , Input gate , Output gate , To protect and control cell state
LSTM Detailed introduction to the three doors of
1 Forgetting gate
LSTM The first step in determining what information to discard from the cellular state . Through a forgotten door . The door reads ht−1 and xt, Output one in 0 reach 1
Between the values given to each in the cell state Ct−1Ct−1 Number in .1 express “ Completely reserved ”,0 express “ Give up completely ”.
Decided to remember it , Forget it , The new input is bound to have an impact .
2 Input gate
This step determines what information is stored in the cell state .
It's time to update the old cell state ,Ct−1 Update to Ct
3 Output gate
final , We need to determine what value to output . This output will be based on our cell state , But it's also a filtered version .
These three doors are different in function , But it is the same in the operation of executing the task .
LSTM The metamorphosis of ——GRU
It combines forgetting and input gates into a single update gate , It's also a mixture of cellular and hidden states . The final model is better than the standard model LSTM The model should be simple , It's also a very popular variant .
GRU yes LSTM A better variant of the network , It is more than LSTM The variation of network structure is simpler , And the effect is very good , It can also be solved RNN Long term dependence in network .
GRU There are only two doors in the model ： They are update gate and reset gate . The zt and rt Represents update gate and reset gate respectively .
The update gate is used to control the extent to which the previous state information is brought into the current state , The larger the value of the update gate, the more state information is brought in at the previous time . How much information is written to the current candidate set in the previous state of reset gate control
h~t upper , The smaller the reset door is , The less information about the previous state is written .
In a nutshell ,LSTM and GRU The important features are preserved by various gate functions , This ensures that the long-term It's not lost when it's spread . in addition ,GRU be relative to LSTM A gate function is missing , Therefore, the number of parameters is less than LSTM Of , So on the whole GRU Train faster than LSTM Of . However, the quality of the two networks depends on the specific application scenarios .
During training time and epoch upper ,GRU Even better
GRU than LSTM Be simple , Less parameters 1/3, It is not easy to over fit
1 GRU And LSTM Some of the performance is better than others in many tasks
2 GRU Has fewer parameters , Easier convergence , It is not easy to over fit , But when the data set is large ,LSTM Better performance
3 Structurally ,GRU There are only two doors （update and reset）,LSTM There are three doors （forget, input,
output）,GRU Pass the hidden state to the next unit , and LSTM Then use memory cell hold hidden state Wrap it up .GRU Merge cell state with hidden state .