RNN It is a kind of neural network used to process sequence data

Sequence data : Time series data refers to the data collected at different time points , This kind of data reflects something , The state or degree of change over time, as of phenomena .

The basic neural network only establishes weight connection between layers ,RNN The biggest difference is the right connections between the neurons between the layers .

The picture above is a standard one RNN Structure diagram , Each arrow represents a transformation , In other words, the arrow connection has a weight . The left side is folded up , On the right, it's unfolded , Left middle h The arrow next to represents the “ loop ” It is reflected in the hidden layer .

RNN There are also the following features :

1  Weight sharing , In the picture W All the same ,U and V It's the same thing

2  Each input value is only connected with its own route , It doesn't connect to other neurons

RNN A variety of


standard RNN Forward output process of

x It's input ,h It is a hidden layer unit ,o Is the output ,L Is the loss function ,y Label for training set . These elements are in the upper right corner t representative t The state of the moment , unit h stay t The performance of a moment is not only determined by the input of the moment , I'm still t Influence of moments before moments .V,W,U It's the weight , The weight of the same type of weight connection is the same . With the above understanding , The forward propagation algorithm is actually very simple , about t time :

RNN Training methods ——BPTT

BPTT(back-propagation through
time) Algorithm is a common training RNN Methods , In fact, the essence is still BP algorithm .BPTT The central idea of BP The algorithm is the same , Along the negative gradient direction of the parameters to be optimized, the better points are continuously searched until they converge . in summary ,BPTT Algorithm essential gradient descent method , Then the core of this algorithm is to find the gradient of each parameter . 

You may object ,RNN It is different from deep neural network ,RNN Are shared , And the gradient of a certain moment is the sum of this time and the previous time , Even if it doesn't reach the deepest point , That shallow layer also has gradient . This is, of course, right , But if we update the shared parameters of more layers according to the gradient of finite layers, there will be problems .

Gradient disappearance or gradient explosion

Long term dependence

When the interval increases ,RNN You lose the ability to learn information that connects so far , There is a problem of long-term dependence


All RNN They all have a chain form of repetitive neural network modules . In the standard RNN species , This repeated module has only one very simple structure , For example, there is one tanh layer .

LSTM It's the same structure , But repeated modules have a different structure . Different from a single neural network layer , In addition to h Flow over time , Cell state c It's also flowing over time , Cell state c It means long-term memory .

LSTM The core idea of

LSTM The key is cell state , The horizontal line runs through the top of the graph .

Cell state , It's like a conveyor belt . Run directly across the chain , There are only a few linear interactions . It's easy to keep the message going .

LSTM There are those that have been carefully designed as “ door ” The ability to remove or add information to the cellular state . Gate is a way to let information go through selectively . They include a sigmoid Neural network layer and a pointwise Multiplication operation .

LSTM It has three doors : Forgetting gate , Input gate , Output gate , To protect and control cell state

LSTM Detailed introduction to the three doors of

1  Forgetting gate

 LSTM The first step in determining what information to discard from the cellular state . Through a forgotten door . The door reads ht−1 and xt, Output one in 0 reach 1
Between the values given to each in the cell state Ct−1Ct−1 Number in .1 express “ Completely reserved ”,0 express “ Give up completely ”.

Decided to remember it , Forget it , The new input is bound to have an impact .

2  Input gate

This step determines what information is stored in the cell state .

It's time to update the old cell state ,Ct−1 Update to Ct

3  Output gate

final , We need to determine what value to output . This output will be based on our cell state , But it's also a filtered version .

These three doors are different in function , But it is the same in the operation of executing the task .

LSTM The metamorphosis of ——GRU

It combines forgetting and input gates into a single update gate , It's also a mixture of cellular and hidden states . The final model is better than the standard model LSTM The model should be simple , It's also a very popular variant .

GRU yes LSTM A better variant of the network , It is more than LSTM The variation of network structure is simpler , And the effect is very good , It can also be solved RNN Long term dependence in network .

GRU There are only two doors in the model : They are update gate and reset gate . The zt and rt Represents update gate and reset gate respectively .

The update gate is used to control the extent to which the previous state information is brought into the current state , The larger the value of the update gate, the more state information is brought in at the previous time . How much information is written to the current candidate set in the previous state of reset gate control
h~t upper , The smaller the reset door is , The less information about the previous state is written .


In a nutshell ,LSTM and GRU The important features are preserved by various gate functions , This ensures that the long-term It's not lost when it's spread . in addition ,GRU be relative to LSTM A gate function is missing , Therefore, the number of parameters is less than LSTM Of , So on the whole GRU Train faster than LSTM Of . However, the quality of the two networks depends on the specific application scenarios .

During training time and epoch upper ,GRU Even better

GRU than LSTM Be simple , Less parameters 1/3, It is not easy to over fit

conclusion :

1 GRU And LSTM Some of the performance is better than others in many tasks

2 GRU Has fewer parameters , Easier convergence , It is not easy to over fit , But when the data set is large ,LSTM Better performance

3 Structurally ,GRU There are only two doors (update and reset),LSTM There are three doors (forget, input,
output),GRU Pass the hidden state to the next unit , and LSTM Then use memory cell hold hidden state Wrap it up .GRU Merge cell state with hidden state .

©2019-2020 Toolsou All rights reserved,
Send love - A little romance for programmers VHDL—— Design of frequency divider Python Implementation of Hanoi Tower code It's over , Starting salary 30khtml+css+js Make a simple website home page QQ Login interface implementation Hill sorting of sorting algorithm ——c++ realization 【 Wechat applet learning 】 Netease music cloud code page implementation details Resume the 13th session python Blue Bridge Cup 2022 Solution to the 13th Blue Bridge Cup ( whole )