RNN It is a kind of neural network used to process sequence data

Sequence data : Time series data refers to the data collected at different time points , This kind of data reflects something , The state or degree of change over time, as of phenomena .

The basic neural network only establishes weight connection between layers ,RNN The biggest difference is the right connections between the neurons between the layers .

The picture above is a standard one RNN Structure diagram , Each arrow represents a transformation , In other words, the arrow connection has a weight . The left side is folded up , On the right, it's unfolded , Left middle h The arrow next to represents the “ loop ” It is reflected in the hidden layer .

RNN There are also the following features :

1  Weight sharing , In the picture W All the same ,U and V It's the same thing

2  Each input value is only connected with its own route , It doesn't connect to other neurons

RNN A variety of


standard RNN Forward output process of

x It's input ,h It is a hidden layer unit ,o Is the output ,L Is the loss function ,y Label for training set . These elements are in the upper right corner t representative t The state of the moment , unit h stay t The performance of a moment is not only determined by the input of the moment , I'm still t Influence of moments before moments .V,W,U It's the weight , The weight of the same type of weight connection is the same . With the above understanding , The forward propagation algorithm is actually very simple , about t time :

RNN Training methods ——BPTT

BPTT(back-propagation through
time) Algorithm is a common training RNN Methods , In fact, the essence is still BP algorithm .BPTT The central idea of BP The algorithm is the same , Along the negative gradient direction of the parameters to be optimized, the better points are continuously searched until they converge . in summary ,BPTT Algorithm essential gradient descent method , Then the core of this algorithm is to find the gradient of each parameter . 

You may object ,RNN It is different from deep neural network ,RNN Are shared , And the gradient of a certain moment is the sum of this time and the previous time , Even if it doesn't reach the deepest point , That shallow layer also has gradient . This is, of course, right , But if we update the shared parameters of more layers according to the gradient of finite layers, there will be problems .

Gradient disappearance or gradient explosion

Long term dependence

When the interval increases ,RNN You lose the ability to learn information that connects so far , There is a problem of long-term dependence


All RNN They all have a chain form of repetitive neural network modules . In the standard RNN species , This repeated module has only one very simple structure , For example, there is one tanh layer .

LSTM It's the same structure , But repeated modules have a different structure . Different from a single neural network layer , In addition to h Flow over time , Cell state c It's also flowing over time , Cell state c It means long-term memory .

LSTM The core idea of

LSTM The key is cell state , The horizontal line runs through the top of the graph .

Cell state , It's like a conveyor belt . Run directly across the chain , There are only a few linear interactions . It's easy to keep the message going .

LSTM There are those that have been carefully designed as “ door ” The ability to remove or add information to the cellular state . Gate is a way to let information go through selectively . They include a sigmoid Neural network layer and a pointwise Multiplication operation .

LSTM It has three doors : Forgetting gate , Input gate , Output gate , To protect and control cell state

LSTM Detailed introduction to the three doors of

1  Forgetting gate

 LSTM The first step in determining what information to discard from the cellular state . Through a forgotten door . The door reads ht−1 and xt, Output one in 0 reach 1
Between the values given to each in the cell state Ct−1Ct−1 Number in .1 express “ Completely reserved ”,0 express “ Give up completely ”.

Decided to remember it , Forget it , The new input is bound to have an impact .

2  Input gate

This step determines what information is stored in the cell state .

It's time to update the old cell state ,Ct−1 Update to Ct

3  Output gate

final , We need to determine what value to output . This output will be based on our cell state , But it's also a filtered version .

These three doors are different in function , But it is the same in the operation of executing the task .

LSTM The metamorphosis of ——GRU

It combines forgetting and input gates into a single update gate , It's also a mixture of cellular and hidden states . The final model is better than the standard model LSTM The model should be simple , It's also a very popular variant .

GRU yes LSTM A better variant of the network , It is more than LSTM The variation of network structure is simpler , And the effect is very good , It can also be solved RNN Long term dependence in network .

GRU There are only two doors in the model : They are update gate and reset gate . The zt and rt Represents update gate and reset gate respectively .

The update gate is used to control the extent to which the previous state information is brought into the current state , The larger the value of the update gate, the more state information is brought in at the previous time . How much information is written to the current candidate set in the previous state of reset gate control
h~t upper , The smaller the reset door is , The less information about the previous state is written .


In a nutshell ,LSTM and GRU The important features are preserved by various gate functions , This ensures that the long-term It's not lost when it's spread . in addition ,GRU be relative to LSTM A gate function is missing , Therefore, the number of parameters is less than LSTM Of , So on the whole GRU Train faster than LSTM Of . However, the quality of the two networks depends on the specific application scenarios .

During training time and epoch upper ,GRU Even better

GRU than LSTM Be simple , Less parameters 1/3, It is not easy to over fit

conclusion :

1 GRU And LSTM Some of the performance is better than others in many tasks

2 GRU Has fewer parameters , Easier convergence , It is not easy to over fit , But when the data set is large ,LSTM Better performance

3 Structurally ,GRU There are only two doors (update and reset),LSTM There are three doors (forget, input,
output),GRU Pass the hidden state to the next unit , and LSTM Then use memory cell hold hidden state Wrap it up .GRU Merge cell state with hidden state .

©2019-2020 Toolsou All rights reserved,
1190 Reverses the substring between each pair of parentheses leetcodemysql Joint index details You don't know ——HarmonyOS Create data mysql Library process Character recognition technology of vehicle license plate based on Neural Network A guess number of small games , use JavaScript realization Talking about uni-app Page value transfer problem pytorch of ResNet18( Yes cifar10 The accuracy of data classification is achieved 94%)C++ Method of detecting memory leak One is called “ Asking for the train ” A small village Finally got the train