Skip to content

kbzh2558/Social_Speculation_for_Harnessing_Reddit_to_Forecast_Bitcoin_Fluctuations

Repository files navigation

📝🔗 Social Speculation: Harnessing Reddit to Forecast Bitcoin Fluctuations

⚡ETL, Sentiment Analysis, Multi-Layer Perceptron Modelling, and Dropout Regularization⚡

Note

This project was conducted by Kaibo Zhang, student from the Desautels Faculty of Management at McGill University. The study utilized publicly available data from the > Reddit and Alpha Vantage API and was supervised as part of the course INSY-336-001 DataHandl&Coding for Analytics. The API links dataset can be found here.

Overview

This project investigates the role of Reddit sentiment as an indicator for predicting Bitcoin (BTC) price fluctuations. Recognizing the complex, sentiment-driven nature of cryptocurrency, we explore models that leverage sentiment analysis on Reddit comments to forecast BTC's closing price. Starting from a baseline regression, we refine our model by employing a multi-layer perceptron (MLP) with dropout regularization, achieving a solution that balances complexity and generalization.

Repo Structure

Social_Speculation/
├── README.md
├── ETL_workflow.ipynb           # Data extraction, transformation, and loading scripts
├── Model_prediction.ipynb       # Training and evaluation of MLP model with dropout regularization
├── credentials.py               # Access credentials for API access
├── linear_model.py              # Baseline linear regression model implementation
├── mlp_model.py                 # One-layered MLP implementation
├── mlp.py                       # Two-layered MLP implementation
├── mlp_dropout.py               # Two-layered MLP implementation with dropout regularization
├── best_model.pth               # Saved state of the best-performing model during early stopping
├── mlp_trained_model.pth        # Saved state of the final trained MLP model
├── reddit.db                    # Database of Reddit comments and BTC data
├── report.md                    # Detailed report
├── image.png                    # Visual depiction of model structure or results

Step-by-Step Breakdown

  1. Data Collection and ETL Process
    • Data for this study was obtained from Reddit, focusing on BTC-related posts and comments. Sentiment scores were calculated using sentiment analysis to quantify public opinion.
    • Key preprocessing steps included:
      • Filtering and structuring Reddit data to ensure relevance and consistency.
      • Calculating sentiment polarity scores for each comment to assess public sentiment on BTC.

    NOTE: This process streamlined data preparation for subsequent analysis, ensuring that only relevant, clean data entered the modeling pipeline.

  2. Baseline Model: Simple Regression
    • A foundational regression model was implemented to map sentiment scores to BTC's closing price directly.

    • This model assumed an immediate impact of sentiment on price, offering a straightforward but limited approach, primarily useful as a benchmark against more complex models.

    • Challenges: This approach showed limitations in handling intricate relationships and temporal dependencies.

  3. Transition to Multi-Layer Perceptron (MLP) with Dropout
    • Observing underfitting with a single-layer MLP, a two-layer MLP structure with dropout regularization was introduced to enhance learning complexity while mitigating overfitting.

    • Dropout layers were added between hidden layers, randomly deactivating neurons during training to prevent over-reliance on specific nodes, which allowed the model to generalize better.

    • Results: This MLP structure demonstrated improved ability to capture sentiment-related patterns in BTC price fluctuations, balancing model complexity and generalization.

  4. Limitations and Potential Improvements
    • Limitations: While the model performed well on short-term price fluctuations, it struggled to capture broader directional trends in BTC’s price due to the lack of temporal awareness in the current structure.

    • Future Work: To address this, a Convolutional Neural Network (CNN) approach for time-series data could be explored to capture both short- and long-term trends in BTC price by treating sentiment scores as one-dimensional sequences.

    • Recommendation: Implementing CNN could enhance the model's capacity to recognize temporal patterns, providing improved prediction accuracy for long-term trends.

Key Takeaways

The refined MLP model represents a robust approach to capturing sentiment-driven fluctuations in Bitcoin price. However, incorporating CNN architecture in future work could strengthen predictive performance by integrating temporal dependencies, offering a valuable tool for forecasting in the volatile cryptocurrency market.

For more details, please refer to report.md for in-depth analyses and modeling insights.

About

Data Extraction and Predictive Modelling on BTC and Reddit Data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published