LSTM for real-time recommendation systems

How to prepare your data for LSTM models.

Kirill Bondarenko

10 min readMar 26, 2019

Everything is a sequence

For beginning, we will start with understanding of our subject. What is a sequence mean ?

In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed. -Wikipedia

Examples of sequences are everywhere:

Text — sequence of words and chars.
Words — sequence of chars.
Music — sequence of sounds.
Annual currency exchange rates — sequence of numbers.
Shopping list — sequence of goods to buy.
Browser history — sequence of seen web pages.

This list is extremely long, you may continue it in your mind by your own examples. But you should understand that every sequence is a time distributed collection of some data.

Let’s have a look for the one of the listed before examples — text.

Text is a sequence of words. When you are reading a text, you read a word by word and understand the text sequentially. The meaning of the text is being created and comes into your mind after reading some part of the text.

Have you ever had an experience when you read a half of the book and guess what is the story ending ? I think so. What about guessing next word ? For example main character in a story loves to say phrases like “I love to play tennis, it is my hobby! ” or “Tennis is a sense of my life” etc. And when you read that other character asks the main one about his/her hobby you may guess that now is going to be written “tennis”. Or when the main character says “I love to …”, the ending will be with “play tennis”.

What about machine learning ? Does LSTM have an ability to make predictions like this ? — Yes.

For example we have a text : “I love tennis, tennis is life”.

Separate them by “,” into two sentences and separate each one by space into words. We will have next result of two sequences with length = 3.

LSTM result will be like “I love tennis is life”.

Result is silly and pure from the grammar point of view. But logic still here.

Here I will not explain LSTM structure and its meaning. There are a lot of great articles like these: Magic of LSTM , Text generation example , Understanding of LSTM . If you have not had an experience with LSTM, you may have a look or read these articles. One more great slideshow with math explanations by this link.

Why do we look at a text example while the topic is recommendation systems ?

Shopping list example

As it was written before in examples listing, a shopping list is a sequence of goods to buy. Let’s have a look for example:

We see four first goods: milk, eggs, cheese and butter.

But for example, this photo was taken by your friend and sent to you to buy it . And he/she switched his/her smartphone and you are trying to guess what is the fifth one good ? We see that is a cream, but let’s imagine that you don’t see it. What will you do ?

Fortunately, you have found in your photo library the 3 last shopping lists photos.

You see, that your current list is ending on a butter. Your three last ones have a cream twice times after butter and once eggs. You make a guess that your friend make a lists in some rule like “dairy products put together” and guess that it might be cream to buy and you guess it ! Congrats !

Shopping list is a text from words (goods names). For sure not in a strict order to buy, but let’s imagine that it’s true.

Smoothly, we are coming to recommendations.

Users history as a text

Chapter title tells us the main idea.

We can look on a users like on a text book with some stories or sentences. But in our case sentence is a sequence of items or items IDs. Let’s look on an upper picture. We have three users. Each one has a history of web search for a certain period of time. Let it be one hour. We want to predict what will be the next one ? Let’s make a text story about these users like they told it to us.

Text: “I use Amazon next I use Google+ and Instagram. I use Instagram and next I use Evernote and Amazon. I use Evernote and next I use Instagram and Google+.”

We will transform this text into sentences, dropping out all words that are not brands and will get the next:

“Amazon,Google+,Instagram” , “Instagram,Evernote,Amazon” , “Evernote,Instagram,Google+” .

To make the input for a model, used a technique called tokenizing.

We will give each element a unique ID.

So, the sequences will be like: “1,2,3" / “3,4,1” / “4,3,2” .

Next step is to transform our raw sequences into supervised observations. Jason Brownlee did a great post for this topic.

Supervised observations is a way of data preparation when we have some inputs and corresponding outputs (targets or labels). Let’s split each sequence into 2 and 1 elements, where first twos are input and last one is a target.

So here we may silly suppose that user 3 will act 4,3 →2,3,4,1 etc. LSTM will do it better.

Main idea that you can transform your non-time series data into time series.

I saw this idea in this article by “Deep Systems” company.

Practical example

Before you start, you must analyze your data and clearly state your goal.

First of all — your data must be real and well formatted. Real data has the same logic between items in it or may be described simple in few words like “users ratings for top 10 our restaurants during the year” ,not “ratings, gender, name, registration date etc.”

Second — your data must be enough to aim your goal. What is enough ?

For example, in keras tutorial with LSTM model via this link, in the beginning is a state:

#Example script to generate text from Nietzsche’s writings.At least 20 epochs are required before the generated text starts sounding coherent.It is recommended to run this script on GPU, as recurrent networks are quite computationally intensive.If you try this script on new data, make sure your corpus has at least ~100k characters. ~1M is better.

If you open this example and test it (good to do), you will see that text has almost 70 unique characters and 100k characters total in a whole text. If we split it into equal sentences with length = 4 words, we will get 25k samples with prediction of only 70 possible classes. So, this data is enough.

I’ve spent few weeks to make a practical example with MovieLens 100k data set, but failed with validation accuracy no greater than 10%. After that I got an overfitted model. But the problem was in data. Because there are almost 8000(!) possible classes to predict (unique movies) and after splitting into sequences with length = 5, I got only 25000 samples.

For 70 classes it’s good, but for 8000 — impossible.

I’ll show how to transform your data into suitable form for keras model.

First of all we need a data.

Let it be our example from previous chapter with web resources.

This data is pure, but it enough to show each step in a simple way.

“Amazon,Google+,Instagram” , “Instagram,Evernote,Amazon” , “Evernote,Instagram,Google+” — raw data (here may be every your data like items in an online shop etc.)

Start writing in Python 3.6:

text = "Amazon,Google+,Instagram|Instagram,Evernote,Amazon,Yahoo|Evernote,Instagram,Google+"

I used special symbol ‘|’ to separate each user data profile from others. But also added a “Yahoo” to the second user to make sequences different lengths to make our silly data more real (you never find a user data set with equal histories lengths).

Initial imports:

from keras.utils import to_categorical
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
import numpy as np

The function of mapping an unique ID to each word (item) in text will do keras Tokenizer.

tokenizer = Tokenizer()

tokenizer.fit_on_texts([text])
vocabulary_size = len(tokenizer.word_index) + 1
print('Unique items: %d' % vocabulary_size)
sequences = list()
for line in text.split('|'):
    encoded = tokenizer.texts_to_sequences([line])[0]
    sequences.append(encoded)
print('Total Sequences: %d' % len(sequences))
print(sequences)
#Output =>
#Total Sequences: 3
#[[2, 3, 1], [1, 4, 2, 5], [4, 1, 3]]

New view of our data: [[2, 3, 1], [1, 4, 2, 5], [4, 1, 3]].

Next our aim is to transform our sequences to the same length. There is a good tool — padding, also in keras API.

max_len = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_len, padding='pre')
print(sequences)
print('Max Sequence Length: %d' % max_len)
sequences = np.array(sequences)
#Output =>
#[[0 2 3 1]
# [1 4 2 5]
# [0 4 1 3]]

Now our data is like this: [[0 2 3 1] , [1 4 2 5],[0 4 1 3]].

Finally, we transform our data into form of supervised learning: samples and targets. We will predict fourth item in a sequence by the first three items.

X, y = sequences[:, :-1], sequences[:, -1]
print(X)
print(y)
# Output
#[[0 2 3]
# [1 4 2]
# [0 4 1]] -- samples
# [1 5 3] -- targets

And one more thing to do: categorize our targets via one hot encoding technique (target ‘1’ →[0,1,0,0,0,0]).

y = to_categorical(y, num_classes=vocabulary_size)
print(y)
# Output
#[[0. 1. 0. 0. 0. 0.]
# [0. 0. 0. 0. 0. 1.]
# [0. 0. 0. 1. 0. 0.]]

Why don’t we vectorize our input data ? For sure we may do it, but we may loose sense of the input sequence. One hot encoding has a bad peculiarity — it erases time meaning and sense of the words by equalize each one to 1. For sure ‘Instagram’ and ‘Yahoo’ have different meanings and they should not be with equal weight. How to solve it ?

Keras has such thing like Embedding layer. This layer transform input numbers sequence into vector by adding weights to each element by their similarity due to the learning process. It explained here.

And one more bonus here. If we definitely know the weights of similarity of the input data (for example glove embedding for English words), we may add it to the layer and do not train it at all.

If you have a data like movies or items, you may use its genres or unique tags to vectorize them and pairwise compare into embedding matrix.

Good. Let’s make a model.

from keras import Sequential
from keras.layers import Embedding, Dropout, LSTM, Dense
model = Sequential()
model.add(Embedding(vocabulary_size, 5, input_length=max_len - 1))
model.add(Dropout(0.2))
model.add(LSTM(3))
model.add(Dropout(0.2))
model.add(Dense(vocabulary_size, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

LSTM models like to overfit. Use dropouts and regularization in your layers. And do not do them complex at the beginning.

There are a lot of features and necessary knowledge to study to make a good model. Aim of this article is to show the trick of data preparation for LSTM models to create a recommendation system.

Let’s fit the model and have a look.

h = model.fit(X, y, validation_split=0.2, verbose=1, epochs=10)
import matplotlib.pyplot as plt

plt.plot(h.history['loss'], label='Train loss')
plt.plot(h.history['val_loss'], label='Validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()
plt.plot(h.history['acc'], label='Train accuracy')
plt.plot(h.history['val_acc'], label='Validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.show()

For sure it’s bad result. Bu we see positive dynamic — it’s important! exactly our validation loss is decreasing as the train one and our train accuracy is increasing while validation one does not change. Poor result, but obvious with such data. Remind my words in the beginning about data.

Let’s make a prediction.

def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    for _ in range(n_words):
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        yhat = model.predict_classes(encoded, verbose=0)
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)print(generate_seq(model, tokenizer, 3, '1 3 5', 1))
#Output
# instagram

So, our input ‘1 3 5’ = ‘Instagram, Google+,Yahoo’ gives us the output ‘Instagram’.

It’s poor and silly, I will not estimate it. And model is overfitted. This result was obvious. Why ? Lack of data.

But all these methods are good and may be used in serious predictions with good data.

Try it !

Conclusion

LSTM models might be used in recommendation systems.

First of all you need to extend your knowledge about its structure and know all the processes inside them.

Use a text trick to prepare your data. I hope, I explained it here.

Your data must be enough to start learning model. If you have small data set you will fail by overfitting or will get a weak model.

If you train your model on a good data of users previous marks you will get the model that can make predictions in a real-time mode.

Thank you for reading!

Author : Bondarenko K. — machine learning engineer and data science enthusiast :)