Il y a 4 mois -

Temps de lecture 11 minutes

# Long Short-Term Memory (LSTM) Networks for Time Series Forecasting

# Introduction

Forecasting appears to be the most applied quantitative technique to time series. For example, one of the principal tasks of central banks nowadays is to accurately predict inflation rate. Necessary measures are thereafter taken to keep the latter within acceptable levels, allowing a smooth run of the economy.

Most commonly, prediction of a given time sequence involves fitting historical data to build a model and then use it to forecast future values. This methodology relies on the assumption that upcoming tendencies remain similar to preceding

ones. A variety of methods are available to forecast time series: Smoothing Average, Simple Exponential Smoothing or Auto-Regressive Integrated Moving Average (ARIMA) among others.

Recently, deep learning-based algorithms such as Recurrent Neural Networks (RNN) and its special kind Long Short-Term Memory Networks (LSTM) have gained much attention with applications in various areas. While maintaining RNN’s ability to learn the temporal dynamics of a sequential data, LSTM can furthermore handle the vanishing and exploding gradients problem [1].

We aim to focus on LSTM networks with their usage for time series prediction. For a simplified explanation of LSTM, a dataset of inflation in France is first described along with an usual prediction purpose for time series. Two subsequent sections justify why LSTM are said to be an improved version of RNN. We also detail the implementation of LSTM for predicting the Dow Jones Industrial Average Index. Results are compared to those given by an ARIMA model. The article ends with some conclusions and perspectives.

# Dataset and Prediction Purpose

The data is collected from OCDE’s website [2], containing the monthly annualized inflation in France over three years, from January 2017 to December 2019 (Table 1). The inflation is measured by the annual growth rate of the consumer price of a basket of goods and services. For example, in March 2017, the inflation rate is 1.15 %, simply signifying an increase of 1.15 % in consumer price of goods and services compared to March 2016.

Month | Inflation [%] | x_{t} |
---|---|---|

January 2017 | 1.34 |
x
_{1} |

February 2017 | 1.21 |
x |

March 2017 | 1.15 |
x |

… | … | … |

October 2019 | 0.76 |
x |

November 2019 | 1.03 |
x |

December 2019 | 1.45 |
x
_{36} |

*Table 1. Monthly inflation in France [2]*

We name the inflation rates of the 36 months available as x_{1}, x_{2} … x_{36}. The first 24 months are used as the training data to fit LSTM model while the remaining serves as the test set for validating the results. The purpose of our forecast is to use 12 consecutive time points to predict the next (refer to Figure 1). Therefore, we divide the sequence into multiple input-output samples where 12 observations are input whereas the expected output is the inflation rate of the 13^{th} month.

*Figure 1 Input-output samples of the training data
*

# Recurrent Neural Networks

Recurrent Neural Networks [3] is a family of artificial neural networks for processing sequential data. To illustrate the concept, let’s examine how a simple Elmann RNN [4] forecasts the inflation rate of the next 13^{th} month thanks to the previous 12 time points (Figure 2).

Each input sequence, for instance, from x_{1} to x_{12}, is processed in time order to calculate one output o_{12}. Clearly, o_{12} is expected to be close to x_{13}. First, the network updates its hidden state h_{t}, using both the latest one h_{t−1} and the current input x_{t}. Notice that this update is recurrent and is thus repeated at every time step t.

Then, one can derive the output o_{12} from h_{12}.

RNN are able to exhibit temporal behaviours of a time sequence because the previous state h_{t-1} is included in the computation of the current state h_{t}. Indeed, h_{t} can be seen as a memory of historical information about the earlier happenings before t.

*Figure 2 Input-output sample proceeded by RNN*

Unfortunately, RNN suffer from the problem of vanishing and exploding gradients. In fact, the parameters U, V, W of the network are tuned via Back Propagation Through Time (BPTT). Let’s examine, for example, the adjustment of W by the error L which is simply the difference between the output o_{12} and the target x_{13}.

Thanks to the chain rule, the adjustment of W is

Notice that each derivative, coloured in red, can be developed as

Notice that D_{t} = 1 – tanh^{2}(a_{t}) is resulted from the relation h_{t} = tanh(a_{t}) as written above. Thus,

When the base W is smaller than unity, the gradients shrink close to zero (vanishing gradients) due to W^{10}, highlighted in blue, meaning no real optimization of W is done. Inversely, with W superior to one, the gradients blow up to infinity (exploding gradients) and it becomes hard to adjust the network. Obviously, the gradients vanish or explode only if the exponent is large (i.e. in our case study, the exponent is equal to 10) or when one attempts to forecast a time point based on a large number of preceding observations. In other words, long-range dependence cannot be incorporated in updating weights due to the issue of vanishing and exploding gradients.

# Long Short-Term Memory Networks

Long Short-Term Memory Networks are proposed by [5] to address the vanishing and exploding gradients problem. As can be seen in Figure 3, LSTM keep similar structure to that of standard RNN but are different in cell composition. The processing of a time point inside a LSTM cell could be described in the four steps as below.

First, the forget state f is obtained as the output of a sigmoid function σ with x_{t} and h_{t-1} as inputs.

Second, one may calculate the input state i_{t} and the output state o_{t} in a similar manner.

Third, beside a hidden state h_{t}, LSTM introduce a memory cell state c_{t}. To update c_{t}, a candidate g_{t} is first computed as the output of a tanh function.

Then the product i_{t} g_{t} is added to c_{t-1} to update the memory cell state.

Remark that the symbol represents an element-wise multiplication. In particular, since f_{t} is a sigmoid output, its value lays between 0 and 1. When it is close to zero, f_{t} entirely erases c_{t-1} via the element-wise multiplication. On the contrary, c_{t-1} is completely kept when f_{t} is near one. Owing to the above property, f_{t} is named as forget state.

Fourth, the hidden state h_{t} or the cell output is given by

*Figure 3 Input-output sample proceeded by LSTM*

Let’s consider the adjustment of W, as in standard RNN, by the difference L between the output of the network o_{12} and the real value x_{13}.

The adjustment of W, via the chain rule, is based on

Each derivative of the problematic term, coloured in red, can be calculated as

Hence,

Notice that in LSTM, there is no direct relationship between hidden states h_{t-1} and h_{t}, which causes the gradients to vanish or explode as in RNN. Instead, only a part of the previous state h_{t-1} is preserved in a LSTM cell while in RNN, h_{t−1} is entirely used at each time step. Furthermore, the update of c_{t} rather controls the time dependence and the information flow. Thus, these additive relationships together with the forget gate help to alleviate the issue of vanishing and exploding gradients.

# Forecast of the Dow Jones Industrial Average

The Dow Jones Industrial Average is one of the most followed stock market indexes by investors, financial professionals and the media. It measures the daily price movements of 30 large American companies on the Nasdaq and the New York Stock Exchange. The Dow Jones Industrial Average is widely viewed as a proxy for general market conditions and even the economy of the United States.

The Dow Jones Industrial Average dataset is downloaded from [6], including 2767 closing records from January 4^{th} 2009 to December 31^{st} 2019. The daily values of 2019, representing about 10% of the entire observations, make up the test data. The other years are employed to build the LSTM model with 70% for training and 20% for validation (Figure 4). Each 12 daily values will be used to forecast the next 13^{th }point.

*Figure 4. Dow Jones Industrial Average [6]*

With the code adopted from [7, 8, 9], the forecast can be carried out in the following steps. The number of units in LSTM is chosen to be equal to the number of inputs. RMSprop optimizer and mean-absolute-error loss function are suitable for our case. RMSprop is proposed by [10] with the main idea is to use a moving average of the squared gradients to normalize the gradient itself. It thus have a damping effect against the vanishing and exploding gradients problem.

Furthermore, it can be noticed from Figure 5 that epochs=200 is enough to get convergence.

*Figure 5. Loss function*

#### Step 1 Import libraries

import os import pandas as pd import numpy as np import matplotlib.pyplot as plt from keras.models import Sequential from keras.layers import Dense, LSTM from keras.optimizers import RMSprop from sklearn.preprocessing import MinMaxScaler from sklearn.metrics import mean_squared_error

#### Step 2 Load and prepare data

# Load data series_DowJones = pd.read_csv("/Input/Dow_Jones_Industrial_Average.csv", header=0, parse_dates=['Date'], index_col="Date", squeeze=True) series = pd.DataFrame(series_DowJones) series_DowJones.index = pd.to_datetime(series_DowJones.index) series = pd.DataFrame(series_DowJones['Dernier'].str.replace(r'.', '').str.replace(r',', '.').astype(float).sort_index()) # Divide data into train and test sets train = series.loc[series.index < '2019-01-01'] test = series.loc[series.index >= '2019-01-01'] # Normalize training data sc = MinMaxScaler(feature_range=(0,1)) train_scaled = sc.fit_transform(train) # Create supervised data with 12 inputs and 1 output n_lag = 12 X_train = [] y_train = [] for i in range(n_lag, len(train)): X_train.append(train_scaled[i-n_lag:i, 0]) y_train.append(train_scaled[i, 0]) X_train, y_train = np.array(X_train), np.array(y_train) # Reshape train set X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

#### Step 3 Build and train model

# Initiate model regressor = Sequential() # Add one LSTM layer regressor.add(LSTM(units=n_lag, input_shape=(X_train.shape[1], 1))) # Add an output layer regressor.add(Dense(units=1)) # Compile the model opt = RMSprop(lr=0.0001) regressor.compile(optimizer = opt, loss = 'mae') # Fit LSTM to the training set with a split for validation history = regressor.fit(X_train, y_train, validation_split=0.2, epochs=200, batch_size=32)

#### Step 4 Forecast future values

# Prepare test set inputs = series[len(series) - len(test) - n_lag:].values inputs = inputs.reshape(-1,1) inputs = sc.transform(inputs) X_test = [] for i in range(n_lag, n_lag+len(test)): X_test.append(inputs[i-n_lag:i, 0]) X_test = np.array(X_test) X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1)) # Forecast predict_scaled = regressor.predict(X_test) predict = sc.inverse_transform(predict_scaled) predict = pd.DataFrame(predict) predict.columns = ['Dernier'] predict.index = test.index

The forecast results are shown in Figure 6 for Dow Jones Industrial Average in 2019. While comparing the outputs of the model against the test data, root-mean squared errors (RMSE) obtained for LSTM is 931 which is approximately 3.53% the data range. The prediction is thus close to the real values and fluctuating trends of the index are well captured by the model.

An ARIMA model is developed with the package statsmodels and its parameters (p, d, q) = (3, 1, 3) are obtained by a grid searching aiming to minimize the Akaike information criterion (AIC). It can be seen from Figure 5 that ARIMA reaches a slightly smaller RMSE (a RMSE of 803, about 3.04% the data range) and is thus more precise than LSTM. However, from a practical point of view, the parameter selection of ARIMA requires a time-consuming grid search and each forecast needs another model fitting with updated data while one fitting is necessary for LSTM without any feature engineering.

Figure 6. Prediction of *Dow Jones Industrial Average
*

# Conclusions

An overview of LSTM technique is earlier presented via a common use case for forecasting time series. We mathematically clarify, by a set of simplified equations, how LSTM tackle the problem of vanishing and exploding gradients which occurs in standard RNN. An application is detailed at the end of the article, showing an implementation of a LSTM model for the Dow Jones Industrial Average. In the use case of the Dow Jones Industrial Average, both LSTM and ARIMA give good prediction results while examining against the test set. However, LSTM is more suitable for time series forecasting in practice with one single fitting and without any parameter optimization.

An alternative architecture of LSTM networks could be Gated Recurrent Units (GRU) [11]. As reported by [12], on the one hand, GRU have fewer parameters and thus may train faster or need less data to generalize and on the other hand, better results could be obtained with enough data thanks to the greater expressive power of LSTM.

# Reference

[1] http://colah.github.io/posts/2015-08-Understanding-LSTMs

[2] https://data.oecd.org/fr/price/inflation-ipc.htm

[3] Rumelhart, Hinton & Williams (1986). Nature. 323:533-536

[4] Elman (1990). Cognitive Science. 14(2): 179-211

[5] Hochreiter & Schmidhuber (1997). Neural Computation. 9(8):1735-1780

[8] https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python

[9] https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting

[10] Hinton, Srivastava & Swersky (?) Coursera Course on Neural Networks

[11] Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk, Bengio (2014). Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing

[13] Photo Credits — ©max_776 — stock.adobe.com

## Commentaire