Predicting Stock Market

  • Date — The date of the record.
  • Open — The opening price of the day (when trading starts).
  • High — The highest trade price during the day.
  • Low — The lowest trade price during the day.
  • Close — The closing price for the day (when trading is finished).
  • Volume — The number of shares traded.
  • Adj Close — The daily closing price, adjusted retroactively to include any corporate actions. Read more here.
import pandas as pd

df = pd.read_csv('sphist.csv')
df.head()
from datetime import datetime

df['Date'] = pd.to_datetime(df['Date'])
df[df['Date'] > datetime(year = 2015, month =4, day=1)]
df = df.sort_values('Date').reset_index()
df.head()
df.info()

Generating indicators

Datasets taken from the stock market need to be handled differently than datasets from other sectors when it comes time to make predictions. In a normal machine learning exercise, we treat each row as independent. Stock market data is sequential, and each observation comes a day after the previous observation. Thus, the observations are not all independent, and we can’t treat them as such.

  • The average price from the past 5 days.
  • The average price for the past 30 days.
  • The average price for the past 365 days.
  • The ratio between the average price for the past 5 days, and the average price for the past 365 days.
  • The standard deviation of the price over the past 5 days.
  • The standard deviation of the price over the past 365 days.
  • The ratio between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.
#Calculate the mean for the past 5, 30, 365 days
df['day_5'] = df['Close'].rolling(5).mean().shift(1)
df['day_30'] = df['Close'].rolling(30).mean().shift(1)
df['day_365'] = df['Close'].rolling(365).mean().shift(1)

#Calculate the STD for the past 5, 365 days
df['std_5'] = df['Close'].rolling(5).std().shift(1)
df['std_365'] = df['Close'].rolling(365).std().shift(1)

#Calculate the mean volume for the past 5, 365 days
df['day_5_volume'] = df['Volume'].rolling(5).mean().shift(1)
df['day_365_volume'] = df['Volume'].rolling(365).mean().shift(1)

#Calculate the STD of the average volume over the past five days
df['5_volume_std'] = df['day_5_volume'].rolling(5).std().shift(1)
df.head(10)
df.tail(10)

Splitting up the data

Since we’re computing indicators that use historical data, there are some rows where there isn’t enough historical data to generate them. Some of the indicators use 365 days of historical data, and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-01-03 don't have enough historical data to compute all the indicators. We’'ll need to remove these rows before we split the data.

df = df[df['Date'] > datetime(year = 1951, month = 1, day = 3)]
df.head()
df.isnull().sum()
df = df.dropna(axis = 0)
df.isnull().sum()
train = df[df['Date'] < datetime(year = 2013, month = 1, day = 1)]
test = df[df['Date'] >= datetime(year = 2013, month = 1, day = 1)]
train.shape(15486, 16)test.shape(739, 16)train.columnsIndex([‘index’, ‘Date’, ‘Open’, ‘High’, ‘Low’, ‘Close’, ‘Volume’, ‘Adj Close’, ‘day_5’, ‘day_30’, ‘day_365’, ‘std_5’, ‘std_365’, ‘day_5_volume’, ‘day_365_volume’, ‘5_volume_std’], dtype=’object’)

Making predictions

Now, we can define an error metric, train a model using the train data, and make predictions on the test data.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

features = ['day_5', 'day_30', 'day_365', 'std_5', 'std_365', 'day_5_volume',
'day_365_volume', '5_volume_std']
target = train['Close']

lr = LinearRegression()
lr.fit(train[features],target)
predictions = lr.predict(test[features])
mse = mean_squared_error(test['Close'], predictions)
mse
494.6605406939939
train_1 = df.iloc[:-1]
test_1 = df.iloc[-1:]

lr.fit(train_1[features],train_1['Close'])
predictions_1 = lr.predict(test_1[features])
mse_1 = mean_squared_error(test_1['Close'], predictions_1)
mse_1
9.629910554420343
  • Accuracy would improve greatly by making predictions only one day ahead. For example, train a model using data from 1951-01-03 to 2013-01-02, make predictions for 2013-01-03, and then train another model using data from 1951-01-03 to 2013-01-03, make predictions for 2013-01-04, and so on. This more closely simulates what we’'d do if We were trading using the algorithm.
  • We can also improve the algorithm used significantly. Try other techniques, like a random forest, and see if they perform better.
  • We can also incorporate outside data, such as the weather in New York City (where most trading happens) the day before, and the amount of Twitter activity around certain stocks.
  • We can also make the system real-time by writing an automated script to download the latest data when the market closes, and make predictions for the next day.
  • Finally, We can make the system “higher-resolution”. We’re currently making daily predictions, but We could make hourly, minute-by-minute, or second by second predictions. This will require obtaining more data, though. We could also make predictions for individual stocks instead of the S&P500.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bibek Shah Shankhar

Bibek Shah Shankhar

I post articles on Data Science | Machine Learning | Deep Learning . Connect with me on Linkedln: https://www.linkedin.com/in/bibek-shah-shankhar/