# Predicting Stock Market

• Date — The date of the record.
• Open — The opening price of the day (when trading starts).
• High — The highest trade price during the day.
• Low — The lowest trade price during the day.
• Close — The closing price for the day (when trading is finished).
• Volume — The number of shares traded.
• Adj Close — The daily closing price, adjusted retroactively to include any corporate actions. Read more here.
`import pandas as pddf = pd.read_csv('sphist.csv')df.head()`
`from datetime import datetimedf['Date'] = pd.to_datetime(df['Date'])`
`df[df['Date'] > datetime(year = 2015, month =4, day=1)]`
`df = df.sort_values('Date').reset_index()df.head()`
`df.info()`

# Generating indicators

Datasets taken from the stock market need to be handled differently than datasets from other sectors when it comes time to make predictions. In a normal machine learning exercise, we treat each row as independent. Stock market data is sequential, and each observation comes a day after the previous observation. Thus, the observations are not all independent, and we can’t treat them as such.

• The average price from the past `5` days.
• The average price for the past `30` days.
• The average price for the past `365` days.
• The ratio between the average price for the past `5` days, and the average price for the past `365` days.
• The standard deviation of the price over the past `5` days.
• The standard deviation of the price over the past `365` days.
• The ratio between the standard deviation for the past `5` days, and the standard deviation for the past `365` days.
`#Calculate the mean for the past 5, 30, 365 daysdf['day_5'] = df['Close'].rolling(5).mean().shift(1)df['day_30'] = df['Close'].rolling(30).mean().shift(1)df['day_365'] = df['Close'].rolling(365).mean().shift(1)#Calculate the STD for the past 5, 365 daysdf['std_5'] = df['Close'].rolling(5).std().shift(1)df['std_365'] = df['Close'].rolling(365).std().shift(1)#Calculate the mean volume for the past 5, 365 daysdf['day_5_volume'] = df['Volume'].rolling(5).mean().shift(1)df['day_365_volume'] = df['Volume'].rolling(365).mean().shift(1)#Calculate the STD of the average volume over the past five daysdf['5_volume_std'] = df['day_5_volume'].rolling(5).std().shift(1)`
`df.head(10)`
`df.tail(10)`

# Splitting up the data

Since we’re computing indicators that use historical data, there are some rows where there isn’t enough historical data to generate them. Some of the indicators use `365` days of historical data, and the dataset starts on `1950-01-03`. Thus, any rows that fall before `1951-01-03` don't have enough historical data to compute all the indicators. We’'ll need to remove these rows before we split the data.

`df = df[df['Date'] > datetime(year = 1951, month = 1, day = 3)]df.head()`
`df.isnull().sum()`
`df = df.dropna(axis = 0)`
`df.isnull().sum()`
`train = df[df['Date'] < datetime(year = 2013, month = 1, day = 1)]test = df[df['Date'] >= datetime(year = 2013, month = 1, day = 1)]`
`train.shape(15486, 16)test.shape(739, 16)train.columnsIndex([‘index’, ‘Date’, ‘Open’, ‘High’, ‘Low’, ‘Close’, ‘Volume’, ‘Adj Close’, ‘day_5’, ‘day_30’, ‘day_365’, ‘std_5’, ‘std_365’, ‘day_5_volume’, ‘day_365_volume’, ‘5_volume_std’], dtype=’object’)`

# Making predictions

Now, we can define an error metric, train a model using the train data, and make predictions on the test data.

`from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_errorfeatures = ['day_5', 'day_30', 'day_365', 'std_5', 'std_365', 'day_5_volume',       'day_365_volume', '5_volume_std']target = train['Close']lr = LinearRegression()lr.fit(train[features],target)predictions = lr.predict(test[features])mse = mean_squared_error(test['Close'], predictions)mse494.6605406939939`
`train_1 = df.iloc[:-1]test_1 = df.iloc[-1:]lr.fit(train_1[features],train_1['Close'])predictions_1 = lr.predict(test_1[features])mse_1 = mean_squared_error(test_1['Close'], predictions_1)mse_19.629910554420343`
• Accuracy would improve greatly by making predictions only one day ahead. For example, train a model using data from `1951-01-03` to `2013-01-02`, make predictions for `2013-01-03`, and then train another model using data from `1951-01-03` to `2013-01-03`, make predictions for `2013-01-04`, and so on. This more closely simulates what we’'d do if We were trading using the algorithm.
• We can also improve the algorithm used significantly. Try other techniques, like a random forest, and see if they perform better.
• We can also incorporate outside data, such as the weather in New York City (where most trading happens) the day before, and the amount of Twitter activity around certain stocks.
• We can also make the system real-time by writing an automated script to download the latest data when the market closes, and make predictions for the next day.
• Finally, We can make the system “higher-resolution”. We’re currently making daily predictions, but We could make hourly, minute-by-minute, or second by second predictions. This will require obtaining more data, though. We could also make predictions for individual stocks instead of the S&P500.

--

--