LSTM in details is easy
How to discover LSTM properly to create efficient models ?
Introduction
Hello everyone! In this article I want to share with you my knowledge about LSTM/RNN technologies. If you faced with a problem how to understand each part in the recurrent neural networks, this article is for you.
I passed a long way to collect data from many sources and articles by many authors to assemble my own understanding and now I want to put it here in a structured way.
Do not worry about math and other details, here will be a lot of it, but I will try to explain it very simple.
Preparation
We know that recurrent neural networks is like an admission to the university, but before you do it, you need to graduate from school, yes ? With good marks of course.
Here some road map points that you need to understand clearly:
Simple fully connected layer of neural network (forward pass + back propagation): article , article, article .
Revise some math (sigmoid, hyperbolic tangent): article, article .
Beginning: RNN
I am pretty sure that you have seen a lot of such images if you have ever tried to browse some information about LSTM or RNN.
In this good article is a perfect RNN explanation. We will take only a structure and basic equations.
What do we have basically in RNN ?
For example we have a sequence of numbers: -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8.
And we take a frame [1, 2, 3, 4] as a current time input. We want to make an approximation of the future frame [5, 6, 7, 8] . And we know the past frame [-3, -2, -1, 0].
Inside the RNN is placed just an usual neural network. Remember how it works (forward pass) in a simple perceptron ?
We need to multiply input value (numbers sequence) on synapse weight (matrix of weights). And finally apply activation function (hyperbolic tangent in our case).
If we submit it all into a formula we will get:
Where:
- ht — current output in the moment of time “t”
- ht-1 — previous output in the moment of time “t-1”
- xt — current input in the moment of time “t”
- fW — matrix of weights
- b — bias
Example with our sequence:
We have two input values: array with shape (4,) and one more with shape (4,).
When previous output and current input come into neural network, they are being concatenated into one with shape (8,) in our case.
Matrix of weights will have a shape = (maximum sequence length, concatenated vector length) => (4, 8).
Bias will have a shape (4,).
I will use python 3.6 to make calculations:
import numpy as npw = np.array([[1,1,1,1,1,1,1,1],
[2,2,2,2,2,2,2,2],
[3,3,3,3,3,3,3,3],
[3,3,3,3,3,3,3,3]])
x = np.array([-3, -2, -1, 0, 1, 2, 3, 4])
b = np.array([1,1,1,1])h = np.dot(w,x) + bh = np.tanh(h)
Output = [0.99 , 0.99, 1 , 1 ]
But our target is [5, 6, 7, 8] . Well, here is the first point to note:
Scale your input data to the last activation function values range.
Our input was [1, 2, 3, 4] and previous output [-3, -2, -1, 0]. With scaling to the tangent range (-1, 1):
def scale(old_value, old_min, old_max, new_max=1, new_min=-1):
new_value = (((old_value - old_min) * (new_max - new_min)) / (old_max - old_min)) + new_min
return new_value
x = scale(x, -3, 8)
target = scale(target, -3, 8)
h = np.tanh(np.dot(w, x) + b) #prediction
inverse = scale(h,-1,1,8,-3)
Output scaled : [-1. , -0.82, -0.64, -0.46, -0.27, -0.09, 0.09, 0.27]
Target scaled: [0.46, 0.64, 0.82, 1. ]
Prediction: [-0.96 , -0.99, -0.99, -0.99]
Inverse scaling of the prediction:[-2.76, -2.99, -2.99, -2.99]
Of course it is a bad result, but here is the main idea of scaling: first of all scale your data to the (-1, 1) range, make prediction and make an inverse transformation.
Let’s deal with LSTM now.
LSTM
After you know more about RNN it’s time to know more about LSTM.
LSTM architecture consists of three main parts:
- Cell state (Ct) — 1D vector of fixed shape with random values initialization.
- Forget gate(Ft) — special neural network to change cell state with aim of vanishing not important values.
- Remember gate (It)— special neural network to change cell state with aim of adding values to it.
- Output gate (Ot) — special neural network with aim of filtering LSTM inputs with influence of already changed cell state.
Step by step for LSTM:
We will create LSTM with cell state of 8 units. And let it be = [0.1, -0.3, 1, -1, 0.7, 0.1, 0.02, 0.15]. (random)
- We concatenate our two vectors: [1,2,3] and [4,5,6] → [1,2,3,4,5,6]
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6]) #concatenated vector of previous output and current inputdef scale(old_value, old_min, old_max, new_max=1, new_min=-1):
new_value = (((old_value - old_min) * (new_max - new_min)) / (old_max - old_min)) + new_min
return new_value
x = scale(x, -2, 9)target = np.array([7, 8, 9])
2. Using forget gate neural network:
Weights matrix has a shape: (cell_state_length, input_vector_length)
Ft = [1,2,3,4,5,6] x Wf (shape = (8, 6)) + b (shape = (8,))
def sigmoid(value):
return 1 / 1 - np.e ** (-value)
lstm_cell_state = np.array([0.1, -0.3, 1, -1, 0.7, 0.1, 0.02, 0.15])
Wf = np.random.rand(lstm_cell_state.shape[0], x.shape[0])
bias_f = np.random.rand(lstm_cell_state.shape[0])
Ft = sigmoid(np.dot(Wf, x) + bias_f)
Ft = [0.22, 0.49, 0.66, 0.54, 0.22, 0.25, 0.46, 0.45]
The output of the forget gate means: if value = 1 — keep it, if 0 — vanish. So we need to multiply our cell state on Ft.
lstm_cell_state *= Ft
New cell state after forgetting: [0.063, -0.02, 0.31, -0.31, -0.19, 0.01, 0.009, 0.056].
3. Remember gate neural network
This part consists of three actions and two neural networks: remember(sigmoid) and candidate (tanh).
First of all we consider what do we want to remember:
It = sigmoid([1,2,3,4,5,6] x Wi + b)
Wi = np.random.rand(lstm_cell_state.shape[0], x.shape[0])
bias_i = np.random.rand(lstm_cell_state.shape[0])
It = sigmoid(np.dot(Wi, x) + bias_i)
It = [-0.21, 0.42, 0.19, 0.43, 0.28, 0.63, -0.09, 0.7 ]
Next is candidate:
Ct = tanh([1,2,3,4,5,6] x Wc + b)
Wc = np.random.rand(lstm_cell_state.shape[0], x.shape[0])
bias_c = np.random.rand(lstm_cell_state.shape[0])
Ct = np.tanh(np.dot(Wc, x) + bias_c)
Ct = [0.48, 0.77, 0.39, 0.35, 0.62, 0.48, 0.26, 0.34]
And finally we need to multiply It and Ct and add it to the cell state to write there necessary information.
Cell state = Cell state(after forget gate) +(It x Ct)
lstm_cell_state += It * Ct
New lstm cell state = [-0.08, 0.17, 0.49, -0.4, -0.18, 0.02, 0.29, 0.18]
4. Output gate
Finally, we need to make an output of the model.
Ot = sigmoid([1,2,3,4,5,6] x Wo + b)
Wo = np.random.rand(lstm_cell_state.shape[0], x.shape[0])
bias_o = np.random.rand(lstm_cell_state.shape[0])
Ot = sigmoid(np.dot(Wo, x) + bias_o)
Output = Ot * np.tanh(lstm_cell_state)
Output of the model is : [0.16, 0.06, 0.28, -0.22, 0.14, 0.02, 0.17, -0.07]
For sure we can’t interpret these results to make an approximation for our initial target [7,8,9]. We need to create a dense layer to transform LSTM output vector shape(n_units = 8) to the initial (target length = 3). Activation function will be tangent (-1, 1) as our data scaled range.
W_dense = np.random.rand(3, Output.shape[0])
bias_dense = np.random.rand(3)
dense_output = np.tanh(np.dot(W_dense, Output) + bias_dense)
Final output is: [0.48, 0.78, 0.64]
After inverse scaling: [ 7.54, 8.32, 8.41]
Real target: [7, 8, 9]
With this random predictions we got rather good result. Why ? Right data preprocessing and understanding of an each part in LSTM.
Conclusion
Well, here is no explanation about recurrent neural networks training, it will be in my next article, because it’s rather huge topic and deserves special article.
Here are some useful articles to read about LSTM:
I hope this article has helped you in understanding LSTM with its details, because it’s very important to understand each step in your work to do it better.
Good luck!
Bondarenko K. , machine learning engineer :)