In this project, we’ll be working with data from the S&P500 Index. We will be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing, it can also help us make economic forecasts.

We’ll be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset is stored in sphist.csv.

The columns of the dataset are:

  • Date — The date of the record.
  • Open — The opening price of the day (when trading starts).
  • High — The highest trade price during the day.
  • Low — The lowest trade price during the day.
  • Close — The closing price for the day (when trading is finished).
  • Volume — The number of shares traded.
  • Adj Close — The daily closing price, adjusted retroactively to include any corporate actions. Read more here.

We’ll be using this dataset to develop a predictive model. We’ll train the model with data from 1950–2012, and try to make predictions from 2013–2015.

import pandas as pd

df = pd.read_csv('sphist.csv')

Now, lets convert the Date Column into datetime format which will help us to use date time format for the given column.

from datetime import datetime

df['Date'] = pd.to_datetime(df['Date'])

The below code will generate dataframe that tells us if each item in the Date column is after 2015–04–01 .

df[df['Date'] > datetime(year = 2015, month =4, day=1)]

After converting the date column to datetime format .We will sort the dataframe on the Date column. It’s currently in descending order, but we’ll want it to be in ascending order for some of the next steps.

df = df.sort_values('Date').reset_index()

Now, Lets check the about the information of our dataset.

As from the dataset info, we can say that there are no any missing values in the dataset and all of our dataset are in numerical values except the date column which is of course in datetime format.

Generating indicators

Datasets taken from the stock market need to be handled differently than datasets from other sectors when it comes time to make predictions. In a normal machine learning exercise, we treat each row as independent. Stock market data is sequential, and each observation comes a day after the previous observation. Thus, the observations are not all independent, and we can’t treat them as such.

This means we have to be extra careful to not inject “future” knowledge into past rows when we do training and prediction. Injecting future knowledge will make our model look good when we are training and testing it, but will make it fail in the real world. This is how many algorithmic traders lose money.

Here are some indicators that are interesting to generate for each row:

  • The average price from the past 5 days.
  • The average price for the past 30 days.
  • The average price for the past 365 days.
  • The ratio between the average price for the past 5 days, and the average price for the past 365 days.
  • The standard deviation of the price over the past 5 days.
  • The standard deviation of the price over the past 365 days.
  • The ratio between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.

“Days” means “trading days” — so if we’re computing the average of the past 5 days, it should be the 5 most recent dates before the current one. Assume that "price" means the Close column. Always be careful not to include the current price in these indicators! We're predicting the next day price, so our indicators are designed to predict the current price from the previous prices.

We Picked 3 indicators to compute, and generate a different column for each one.

#Calculate the mean for the past 5, 30, 365 days
df['day_5'] = df['Close'].rolling(5).mean().shift(1)
df['day_30'] = df['Close'].rolling(30).mean().shift(1)
df['day_365'] = df['Close'].rolling(365).mean().shift(1)

#Calculate the STD for the past 5, 365 days
df['std_5'] = df['Close'].rolling(5).std().shift(1)
df['std_365'] = df['Close'].rolling(365).std().shift(1)

#Calculate the mean volume for the past 5, 365 days
df['day_5_volume'] = df['Volume'].rolling(5).mean().shift(1)
df['day_365_volume'] = df['Volume'].rolling(365).mean().shift(1)

#Calculate the STD of the average volume over the past five days
df['5_volume_std'] = df['day_5_volume'].rolling(5).std().shift(1)

Now, Lets check the top 10 and last 10 rows of our dataset


Splitting up the data

Since we’re computing indicators that use historical data, there are some rows where there isn’t enough historical data to generate them. Some of the indicators use 365 days of historical data, and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-01-03 don't have enough historical data to compute all the indicators. We’'ll need to remove these rows before we split the data.

df = df[df['Date'] > datetime(year = 1951, month = 1, day = 3)]

Now we have to remove any rows with NaN values too.

Lets check how many NaN values are there in each column.


It looks like there are only 3 columns with same number of null values. Now lets drop them.

df = df.dropna(axis = 0)

Again lets check for the null values.


Let’s now generate two new dataframes to use in making our algorithm. train should contain any rows in the data with a date less than 2013–01–01. test should contain any rows with a date greater than or equal to 2013–01–01.

train = df[df['Date'] < datetime(year = 2013, month = 1, day = 1)]
test = df[df['Date'] >= datetime(year = 2013, month = 1, day = 1)]

Now lets print their shape.

train.shape(15486, 16)test.shape(739, 16)train.columnsIndex([‘index’, ‘Date’, ‘Open’, ‘High’, ‘Low’, ‘Close’, ‘Volume’, ‘Adj Close’, ‘day_5’, ‘day_30’, ‘day_365’, ‘std_5’, ‘std_365’, ‘day_5_volume’, ‘day_365_volume’, ‘5_volume_std’], dtype=’object’)

Making predictions

Now, we can define an error metric, train a model using the train data, and make predictions on the test data.

We are now ready to train the algorithum, make predictions and calculate the Mean Squared Error. Our target column is Close.

Leaving out all of the original columns (Close, High, Low, Open, Volume, Adj Close, Date) when training the model. These all contain knowledge of the future that we don’t want to feed the model. Use the Close column as the target.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

features = ['day_5', 'day_30', 'day_365', 'std_5', 'std_365', 'day_5_volume',
'day_365_volume', '5_volume_std']
target = train['Close']

lr = LinearRegression()[features],target)
predictions = lr.predict(test[features])
mse = mean_squared_error(test['Close'], predictions)

Let’s now make a prediction just one day ahead.

train_1 = df.iloc[:-1]
test_1 = df.iloc[-1:][features],train_1['Close'])
predictions_1 = lr.predict(test_1[features])
mse_1 = mean_squared_error(test_1['Close'], predictions_1)

There’s a lot of improvement still to be made on the indicator side, We can also make significant structural improvements to the algorithm, and pull in data from other sources.

  • Accuracy would improve greatly by making predictions only one day ahead. For example, train a model using data from 1951-01-03 to 2013-01-02, make predictions for 2013-01-03, and then train another model using data from 1951-01-03 to 2013-01-03, make predictions for 2013-01-04, and so on. This more closely simulates what we’'d do if We were trading using the algorithm.
  • We can also improve the algorithm used significantly. Try other techniques, like a random forest, and see if they perform better.
  • We can also incorporate outside data, such as the weather in New York City (where most trading happens) the day before, and the amount of Twitter activity around certain stocks.
  • We can also make the system real-time by writing an automated script to download the latest data when the market closes, and make predictions for the next day.
  • Finally, We can make the system “higher-resolution”. We’re currently making daily predictions, but We could make hourly, minute-by-minute, or second by second predictions. This will require obtaining more data, though. We could also make predictions for individual stocks instead of the S&P500.

You can find the link here : Github

I post articles on Data Science | Machine Learning | Deep Learning . Connect with me on Linkedln: