- Project Description
- Solution Architecture (Google Cloud Platform)
- System in Production
- User Guide
- Model Management (MLFlow)
- Orchestration (Prefect 2.0)
- Monitoring (Evidently AI)
- Dashboard
- Makefile and Dockefile
- Recomendations
- References
- Annexes
- Exploratory Data Analysis
This project aims to develop and deploy a stock price prediction system using an LSTM model, which predicts stock prices with a 10-day lag based on historical OHLC and technical indicators such as MACD, RSI, SMA and EMA. By providing accurate predictions, the system empowers traders to make well-informed decisions, optimize investment strategies, and minimize risks while capitalizing on market fluctuations.
For this project cloud-based approach was chosen. By utilizing Google Cloud Platform (GCP) and MLflow integration, efficiency and scalability of the system is significantly enhanced. Deploying the LSTM model on GCP also enables seamless batch processing of large datasets, ensuring timely predictions. Additionally, MLflow's model tracking and versioning features facilitate efficient monitoring and maintenance of the model's performance over time. Furthermore, the Prefect framework orchestrates the workflow, automating data processing, scheduling, and error handling, resulting in a streamlined and reliable end-to-end solution.
To sum, this stock price prediction system addresses the specific needs of traders and investors by providing accurate predictions, timely insights, and efficient model management through cloud technology. With this solution, traders could stay ahead in the fast-paced financial markets, make informed decisions, and enhance their overall financial outcomes.
Introduction, Motivation, Objectives and Planning Schema are explained in the paper here
Important: Linux (Ubuntu 22.04 LTS) is the recommended development environment.
LSTM model was deployed as Batch Workload using some GCP tools such as:
- Cloud Storage
- Cloud Scheduler/Trigger
- Cloud Functions
- BigQuery
- Google Looker Studio (previously known as Data Studio)
As we can see, the workflow solution implemented in Google Cloud Platform (GCP) leverages serverless batch processing using Cloud Storage, Cloud Functions, BigQuery, and Google Looker Studio to predict stock prices based on historical OHLC (Open, High, Low, Close) data.
The process begins when users upload a CSV file containing OHLC price data for their desired stock to Google Cloud Storage. As soon as the file is uploaded, the Cloud Function is triggered to execute the LSTM (Long Short-Term Memory) model, which is designed to forecast the stock's prices for the next 10 days (see code below).
# This function runs as Google Cloud Function inside GCP
def predict_stock_prices(gs_url: str, bucket_name: str) -> np.array():
# Load the LSTM model from Google Cloud Storage
model_bucket = bucket_name
gcs_client = storage.Client()
bucket = gcs_client.get_bucket(model_bucket)
blob = bucket.blob("lstm_model.pkl")
data_path = gs_url
data = pd.read_csv(data_path, sep=",")
df = data[['Date','Close', 'Symbol']]
symbol = df['Symbol'].iloc[0]
df = df.set_index('Date')
df = df.tail(90)
with blob.open(mode = "rb") as f:
lstm_model = pickle.load(f)
# Reshape the input data
num_time_steps = 10 # setting 10 time steps
num_features = 1 # setting 1 feature (Close price)
n = np.int((len(df)/num_time_steps))
subset_df = df.tail(n)
X = np.reshape(df['Close'].values, (len(subset_df), num_time_steps, num_features))
X = np.repeat(X, 9, axis=-1)
# Perform predictions using the LSTM model
predictions = lstm_model.predict(X)
min_value = np.min(df['Close'])
max_value = np.max(df['Close'])
predictions = predictions[1]
predictions = np.zeros(9)
predictions = (predictions * (max_value - min_value) + min_value)
y_predicted = np.squeeze(predictions)
start_date = df.index[-1]
dates = pd.date_range(start=pd.to_datetime(start_date) + pd.DateOffset(days=1),
periods=len(y_predicted),
freq='D')
new_data = pd.DataFrame({'Close': y_predicted, 'Symbol': symbol}, index=dates)
data_result = pd.concat([df, new_data])
data_result.index = pd.to_datetime(data_result.index)
data_result.index = data_result.index.tz_localize(None)
print(data_result)
# Save the predictions to BigQuery
bq_client = bigquery.Client()
dataset_id = 'stock_output'
table_id = 'predicted_prices'
table_ref = bq_client.dataset(dataset_id).table(table_id)
table_string = f"{table_ref.project}.{table_ref.dataset_id}.{table_ref.table_id}"
data_result = data_result.reset_index(drop=False)
data_result.to_gbq(destination_table=table_string, project_id='ambient-decoder-391319', if_exists='replace')
return predictions
The results of the LSTM model are then stored in a dedicated BigQuery table named predicted_prices
. The data in the table feeds into a Looker Studio Dashboard, where users can visualize and explore the forecasted stock prices in a user-friendly and intuitive manner.
The integration of Cloud Functions, BigQuery, and Looker Studio streamlines the entire process, making it a powerful and efficient tool for stock market analysis and decision-making.
Important: Before following this guide ensure you have been authorized to connect GCP project.
To install the app, you need to have Anaconda, Docker, and Docker-Compose installed on your system. You have a perfect installation guide in this link
NOTICE, After installing Anaconda, Google Cloud SDK must be also installed in your system. Lear more here
If you are dealing issues with Anaconda environment, you may find useful some bits
To download the Linux 64-bit archive file, at the command line, run:
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-440.0.0-linux-x86_64.tar.gz
Once Google Cloud SDK had been installed, your account must be authenticate (previously authorized by admin). To do so, just write this in the command line:
gcloud auth application-default login
git clone https://github.com/jblanco89/MLOps_zoomcamp_course.git
Bash file setup.sh
should get writing permission, run next commands:
cd ~/MLOps_zoomcamp_course
sudo chmod +x setup.sh
Finally, run bash file
sudo bash setup.sh
1. Go to the main repo directory:
cd ~/MLOps_zoomcamp_course
2. Install dependencies
pip install -r requirements.txt
3. Run main python file with ticker
and end date
arguments.
For example, supposing that we were interested to predict Microsoft Stock prices in the next 10 days from July 22nd, 2023. Thus, we just need to run this command:
python ./src/main.py MSFT 2023-07-22
This system use Yahoo Finance API to get stock data prices. Check the list of tickers supported:
Some examples:
python src/main.py META 2023-07-25 #Meta
python src/main.py MMM 2023-07-25 #3M
python src/main.py ETH-USD 2023-07-25 #Ethereum
4. Check the results in Dashboard
After a couple of seconds data should have been processed by Google Cloud function. dashboard is available here
Do not forget update data by clicking here:
In Virtual Machine instance of Google Cloud Platform you should use a command like this:
mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://postgres:passwod@sql_private_ip:5432/mlflow --default-artifact-root gs://storage_bucket_name
Experiment Tracking results after 152 runs simulations:
MLFlow was deployed in GCP (scenario 5 according to documentation). You may check MLFlow experiment tracking and model registry clicking in this Link.
Once best model has been identified, we can register and tag it in MLFlow model registry system:
Orchestration management has been set in prefect cloud platform. For self-authentication run this command in your local or virtual machine
prefect cloud login
After best model was chosen with MlFlow, we build workflow using Prefect 2.0. Here the result:
Run the agent by default
prefect agent start --pool agent --work-queue default
If deployed workflow has a custom run, agent is going to detect it and will run workflow automatically. In this system we have scheduled orchestration once a day.
Workflow generates a data drift json
report (DataDriftPreset) using the Evidently library.
def generate_report(Y_pred, Y_test):
report = Report(metrics=[
DataDriftPreset()
])
Y_pred_flattened = np.ravel(Y_pred)
current = pd.DataFrame({'Close': Y_pred_flattened})
# current = pd.DataFrame({'Close': Y_test})
df_reference = pd.DataFrame({'Close': Y_test})
df_reset = df_reference.reset_index(drop=True)
current_reset = current.reset_index(drop=True)
df_reset = pd.DataFrame(df_reset, columns=['Close'])
current_reset = pd.DataFrame(current_reset, columns=['Close'])
report.run(reference_data=df_reset, current_data=current_reset)
return report.save_json("./reports/dataReport.json")
As a result, we can check drift score every day in dashboard.
Once stock prices have been uploaded to BQ table, technical indicators are calculated using sql:
WITH price_data AS (
SELECT
Symbol,
index,
Close,
Close - LAG(Close) OVER (ORDER BY index) AS price_diff
FROM `ambient-decoder-391319.stock_output.predicted_prices`
),
rsi_data AS (
SELECT
Symbol,
index,
Close,
CASE WHEN price_diff > 0 THEN price_diff ELSE 0 END AS gain,
CASE WHEN price_diff < 0 THEN ABS(price_diff) ELSE 0 END AS loss
FROM price_data
),
histogram_data AS (
SELECT
Symbol,
index,
Close,
gain,
loss,
NTILE(5) OVER (ORDER BY Close) AS bucket_number
FROM rsi_data
)
SELECT
Symbol,
index,
Close,
CASE
WHEN avg_gain IS NULL OR avg_loss IS NULL THEN NULL
ELSE 100 - (100 / (1 + (NULLIF(avg_gain, 0) / NULLIF(avg_loss, 0))))
END AS RSI_14_periods,
-- MACD (5 periods)
ema_12 - ema_26 AS MACD_5_periods,
-- SMA (5 periods)
AVG(Close) OVER (ORDER BY index ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS SMA_5_periods,
-- SMA (15 periods)
AVG(Close) OVER (ORDER BY index ROWS BETWEEN 14 PRECEDING AND CURRENT ROW) AS SMA_15_periods,
bucket_number,
COUNT(*) OVER (PARTITION BY bucket_number) AS bucket_count
FROM (
SELECT
Symbol,
index,
Close,
AVG(Close) OVER (ORDER BY index ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS ema_12,
AVG(Close) OVER (ORDER BY index ROWS BETWEEN 25 PRECEDING AND CURRENT ROW) AS ema_26,
AVG(gain) OVER (ORDER BY index ROWS BETWEEN 13 PRECEDING AND CURRENT ROW) AS avg_gain,
AVG(loss) OVER (ORDER BY index ROWS BETWEEN 13 PRECEDING AND CURRENT ROW) AS avg_loss,
bucket_number
FROM histogram_data
)
ORDER BY index DESC;
Using BigQuery we get data table like this:
As a result, we'll be able to see the following dashboard:
# This Dockerfile sets up a Python 3.8 environment,
# installs necessary dependencies (libpq-dev), and
# installs MLflow along with its required packages.
# It then copies the local files into the container's working directory.
# Finally, it runs the MLflow server with specific configurations,
# using PostgreSQL as the backend store and Google Cloud Storage for
# artifact storage.
# syntax=docker/dockerfile:1
FROM python:3.8-slim-buster
RUN apt-get update && apt-get install -y libpq-dev
WORKDIR /MLOps_zoomcamp_course
RUN python -m ensurepip --default-pip && pip install --no-cache-dir --upgrade pip
RUN pip install psycopg2-binary
RUN pip install mlflow
COPY . .
CMD ["mlflow", "server", "-h", "0.0.0.0", "-p", "5000", "--backend-store-uri", "postgresql://postgres:1234@10.28.192.5:5432/mlflow", "--default-artifact-root", "gs://lstm_model_test"]
# This Makefile defines rules for building and running a Docker container
# for a LSTM application. It includes commands to create a data directory,
# set its permissions, build a Docker image, and run the Docker container.
# The .PHONY target ensures these rules are always executed,
# regardless of existing files with the same names as the targets.
DOCKER_IMAGE_NAME = lstm_app_image
CURRENT_DIR := $(shell pwd)
DATA_DIR = data
# Create the directory if it does not exist
create_data_directory:
@if [ ! -d "$(DATA_DIR)" ]; then \
mkdir -p "$(DATA_DIR)"; \
fi
set_data_directory_permissions:
chmod -R 777 "$(DATA_DIR)"
build_docker_image:
docker build -t $(DOCKER_IMAGE_NAME) .
run_docker_container:
docker run -p 5000:5000 $(DOCKER_IMAGE_NAME)
.PHONY: create_data_directory set_data_directory_permissions build_docker_image run_docker_container
Alternatively, process may be much more robust with the following workload:
-
Alla, S., Adari, S.K. (2021). What Is MLOps?. In: Beginning MLOps with MLFlow. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-6549-9_3
-
Bhandari, H. N., Rimal, B., Pokhrel, N. R., Rimal, R., Dahal, K. R., & Khatri, R. K. (2022). Predicting stock market index using LSTM. Machine Learning with Applications, 9, 100320. https://doi.org/10.1016/j.mlwa.2022.100320
-
Moghar, A., & Hamiche, M. (2020). Stock market prediction using LSTM recurrent neural network. Procedia Computer Science, 170, 1168-1173. https://doi.org/10.1016/j.procs.2020.03.049
-
Ghosh, A., Bose, S., Maji, G., Debnath, N., & Sen, S. (2019, September). Stock price prediction using LSTM on the Indian share market. In Proceedings of 32nd international conference on (Vol. 63, pp. 101-110). https://doi.org/10.29007/qgcz
-
Machine Learning to Predict Stock Prices. Utilizing a Keras LSTM model to forecast stock trends (2019). ARTIFICIAL INTELLIGENCE IN FINANCE. https://towardsdatascience.com/predicting-stock-prices-using-a-keras-lstm-model-4225457f0233
-
Stock Market Predictions with LSTM in Python (2020). Datacamp Tutorial. https://www.datacamp.com/tutorial/lstm-python-stock-market
-
Run Prefect on Google Cloud Platform (2022). https://medium.com/@mariusz_kujawski/run-prefect-on-google-cloud-platform-7cc9f801d454
-
Running a serverless batch workload on GCP with Cloud Scheduler, Cloud Functions, and Compute Engine. https://medium.com/google-cloud/running-a-serverless-batch-workload-on-gcp-with-cloud-scheduler-cloud-functions-and-compute-86c2bd573f25
Our approach is mainly based on LSTM architecture. The model is initialized indicating a sequential stack of layers. In this case, a LSTM layer is added with 200 units (or neurons) ensuring that the output is returned for each timestep in the input sequence
A dropout layer with a dropout rate of 0.2 was added to prevent overfitting. Lastly, a dense layer with a single neuron is added, and the activation function relu
- RMSE vs Epochs
- RMSE vs Hidden Units
- RMSE vs Learning Rate
- Hidden Units vs Learning Rate
Here how Cloud Function Looks like in GCP: