Long Short-Term Memory (LSTM) Networks for Time Series Forecasting
Forecasting appears to be the most applied quantitative technique to time series. For example, one of the principal tasks of central banks nowadays is to accurately predict inflation rate. Necessary measures are thereafter taken to keep the latter within acceptable levels, allowing a smooth run of the economy.
Most commonly, prediction of a given time sequence involves fitting historical data to build a model and then use it to forecast future values. This methodology relies on the assumption that upcoming tendencies remain similar to preceding
ones. A variety of methods are available to forecast time series: Smoothing Average, Simple Exponential Smoothing or Auto-Regressive Integrated Moving Average (ARIMA) among others.
Recently, deep learning-based algorithms such as Recurrent Neural Networks (RNN) and its special kind Long Short-Term Memory Networks (LSTM) have gained much attention with applications in various areas. While maintaining RNN’s ability to learn the temporal dynamics of a sequential data, LSTM can furthermore handle the vanishing and exploding gradients problem .
We aim to focus on LSTM networks with their usage for time series prediction. For a simplified explanation of LSTM, a dataset of inflation in France is first described along with an usual prediction purpose for time series. Two subsequent sections justify why LSTM are said to be an improved version of RNN. We also detail the implementation of LSTM for predicting the Dow Jones Industrial Average Index. Results are compared to those given by an ARIMA model. The article ends with some conclusions and perspectives.
Dataset and Prediction Purpose
The data is collected from OCDE’s website , containing the monthly annualized inflation in France over three years, from January 2017 to December 2019 (Table 1). The inflation is measured by the annual growth rate of the consumer price of a basket of goods and services. For example, in March 2017, the inflation rate is 1.15 %, simply signifying an increase of 1.15 % in consumer price of goods and services compared to March 2016.
Table 1. Monthly inflation in France 
We name the inflation rates of the 36 months available as x1, x2 … x36. The first 24 months are used as the training data to fit LSTM model while the remaining serves as the test set for validating the results. The purpose of our forecast is to use 12 consecutive time points to predict the next (refer to Figure 1). Therefore, we divide the sequence into multiple input-output samples where 12 observations are input whereas the expected output is the inflation rate of the 13th month.
Figure 1 Input-output samples of the training data
Recurrent Neural Networks
Recurrent Neural Networks  is a family of artificial neural networks for processing sequential data. To illustrate the concept, let’s examine how a simple Elmann RNN  forecasts the inflation rate of the next 13th month thanks to the previous 12 time points (Figure 2).
Each input sequence, for instance, from x1 to x12, is processed in time order to calculate one output o12. Clearly, o12 is expected to be close to x13. First, the network updates its hidden state ht, using both the latest one ht−1 and the current input xt. Notice that this update is recurrent and is thus repeated at every time step t.
Then, one can derive the output o12 from h12.
RNN are able to exhibit temporal behaviours of a time sequence because the previous state ht-1 is included in the computation of the current state ht. Indeed, ht can be seen as a memory of historical information about the earlier happenings before t.
Figure 2 Input-output sample proceeded by RNN
Unfortunately, RNN suffer from the problem of vanishing and exploding gradients. In fact, the parameters U, V, W of the network are tuned via Back Propagation Through Time (BPTT). Let’s examine, for example, the adjustment of W by the error L which is simply the difference between the output o12 and the target x13.
Thanks to the chain rule, the adjustment of W is
Notice that each derivative, coloured in red, can be developed as
Notice that Dt = 1 – tanh2(at) is resulted from the relation ht = tanh(at) as written above. Thus,
When the base W is smaller than unity, the gradients shrink close to zero (vanishing gradients) due to W10, highlighted in blue, meaning no real optimization of W is done. Inversely, with W superior to one, the gradients blow up to infinity (exploding gradients) and it becomes hard to adjust the network. Obviously, the gradients vanish or explode only if the exponent is large (i.e. in our case study, the exponent is equal to 10) or when one attempts to forecast a time point based on a large number of preceding observations. In other words, long-range dependence cannot be incorporated in updating weights due to the issue of vanishing and exploding gradients.
Long Short-Term Memory Networks
Long Short-Term Memory Networks are proposed by  to address the vanishing and exploding gradients problem. As can be seen in Figure 3, LSTM keep similar structure to that of standard RNN but are different in cell composition. The processing of a time point inside a LSTM cell could be described in the four steps as below.
First, the forget state f is obtained as the output of a sigmoid function σ with xt and ht-1 as inputs.
Second, one may calculate the input state it and the output state ot in a similar manner.
Third, beside a hidden state ht, LSTM introduce a memory cell state ct. To update ct, a candidate gt is first computed as the output of a tanh function.
Then the product it gt is added to ct-1 to update the memory cell state.
Remark that the symbol represents an element-wise multiplication. In particular, since ft is a sigmoid output, its value lays between 0 and 1. When it is close to zero, ft entirely erases ct-1 via the element-wise multiplication. On the contrary, ct-1 is completely kept when ft is near one. Owing to the above property, ft is named as forget state.
Fourth, the hidden state ht or the cell output is given by
Figure 3 Input-output sample proceeded by LSTM
Let’s consider the adjustment of W, as in standard RNN, by the difference L between the output of the network o12 and the real value x13.
The adjustment of W, via the chain rule, is based on
Each derivative of the problematic term, coloured in red, can be calculated as
Notice that in LSTM, there is no direct relationship between hidden states ht-1 and ht, which causes the gradients to vanish or explode as in RNN. Instead, only a part of the previous state ht-1 is preserved in a LSTM cell while in RNN, ht−1 is entirely used at each time step. Furthermore, the update of ct rather controls the time dependence and the information flow. Thus, these additive relationships together with the forget gate help to alleviate the issue of vanishing and exploding gradients.
Forecast of the Dow Jones Industrial Average
The Dow Jones Industrial Average is one of the most followed stock market indexes by investors, financial professionals and the media. It measures the daily price movements of 30 large American companies on the Nasdaq and the New York Stock Exchange. The Dow Jones Industrial Average is widely viewed as a proxy for general market conditions and even the economy of the United States.
The Dow Jones Industrial Average dataset is downloaded from , including 2767 closing records from January 4th 2009 to December 31st 2019. The daily values of 2019, representing about 10% of the entire observations, make up the test data. The other years are employed to build the LSTM model with 70% for training and 20% for validation (Figure 4). Each 12 daily values will be used to forecast the next 13th point.
Figure 4. Dow Jones Industrial Average 
With the code adopted from [7, 8, 9], the forecast can be carried out in the following steps. The number of units in LSTM is chosen to be equal to the number of inputs. RMSprop optimizer and mean-absolute-error loss function are suitable for our case. RMSprop is proposed by  with the main idea is to use a moving average of the squared gradients to normalize the gradient itself. It thus have a damping effect against the vanishing and exploding gradients problem.
Furthermore, it can be noticed from Figure 5 that epochs=200 is enough to get convergence.
Figure 5. Loss function
Step 1 Import libraries
import os import pandas as pd import numpy as np import matplotlib.pyplot as plt from keras.models import Sequential from keras.layers import Dense, LSTM from keras.optimizers import RMSprop from sklearn.preprocessing import MinMaxScaler from sklearn.metrics import mean_squared_error
Step 2 Load and prepare data
# Load data series_DowJones = pd.read_csv("/Input/Dow_Jones_Industrial_Average.csv", header=0, parse_dates=['Date'], index_col="Date", squeeze=True) series = pd.DataFrame(series_DowJones) series_DowJones.index = pd.to_datetime(series_DowJones.index) series = pd.DataFrame(series_DowJones['Dernier'].str.replace(r'.', '').str.replace(r',', '.').astype(float).sort_index()) # Divide data into train and test sets train = series.loc[series.index < '2019-01-01'] test = series.loc[series.index >= '2019-01-01'] # Normalize training data sc = MinMaxScaler(feature_range=(0,1)) train_scaled = sc.fit_transform(train) # Create supervised data with 12 inputs and 1 output n_lag = 12 X_train =  y_train =  for i in range(n_lag, len(train)): X_train.append(train_scaled[i-n_lag:i, 0]) y_train.append(train_scaled[i, 0]) X_train, y_train = np.array(X_train), np.array(y_train) # Reshape train set X_train = np.reshape(X_train, (X_train.shape, X_train.shape, 1))
Step 3 Build and train model
# Initiate model regressor = Sequential() # Add one LSTM layer regressor.add(LSTM(units=n_lag, input_shape=(X_train.shape, 1))) # Add an output layer regressor.add(Dense(units=1)) # Compile the model opt = RMSprop(lr=0.0001) regressor.compile(optimizer = opt, loss = 'mae') # Fit LSTM to the training set with a split for validation history = regressor.fit(X_train, y_train, validation_split=0.2, epochs=200, batch_size=32)
Step 4 Forecast future values
# Prepare test set inputs = series[len(series) - len(test) - n_lag:].values inputs = inputs.reshape(-1,1) inputs = sc.transform(inputs) X_test =  for i in range(n_lag, n_lag+len(test)): X_test.append(inputs[i-n_lag:i, 0]) X_test = np.array(X_test) X_test = np.reshape(X_test, (X_test.shape, X_test.shape, 1)) # Forecast predict_scaled = regressor.predict(X_test) predict = sc.inverse_transform(predict_scaled) predict = pd.DataFrame(predict) predict.columns = ['Dernier'] predict.index = test.index
The forecast results are shown in Figure 6 for Dow Jones Industrial Average in 2019. While comparing the outputs of the model against the test data, root-mean squared errors (RMSE) obtained for LSTM is 931 which is approximately 3.53% the data range. The prediction is thus close to the real values and fluctuating trends of the index are well captured by the model.
An ARIMA model is developed with the package statsmodels and its parameters (p, d, q) = (3, 1, 3) are obtained by a grid searching aiming to minimize the Akaike information criterion (AIC). It can be seen from Figure 5 that ARIMA reaches a slightly smaller RMSE (a RMSE of 803, about 3.04% the data range) and is thus more precise than LSTM. However, from a practical point of view, the parameter selection of ARIMA requires a time-consuming grid search and each forecast needs another model fitting with updated data while one fitting is necessary for LSTM without any feature engineering.
An overview of LSTM technique is earlier presented via a common use case for forecasting time series. We mathematically clarify, by a set of simplified equations, how LSTM tackle the problem of vanishing and exploding gradients which occurs in standard RNN. An application is detailed at the end of the article, showing an implementation of a LSTM model for the Dow Jones Industrial Average. In the use case of the Dow Jones Industrial Average, both LSTM and ARIMA give good prediction results while examining against the test set. However, LSTM is more suitable for time series forecasting in practice with one single fitting and without any parameter optimization.
An alternative architecture of LSTM networks could be Gated Recurrent Units (GRU) . As reported by , on the one hand, GRU have fewer parameters and thus may train faster or need less data to generalize and on the other hand, better results could be obtained with enough data thanks to the greater expressive power of LSTM.
 Rumelhart, Hinton & Williams (1986). Nature. 323:533-536
 Elman (1990). Cognitive Science. 14(2): 179-211
 Hochreiter & Schmidhuber (1997). Neural Computation. 9(8):1735-1780
 Hinton, Srivastava & Swersky (?) Coursera Course on Neural Networks
 Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk, Bengio (2014). Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
 Photo Credits — ©max_776 — stock.adobe.com