Skip to content

Commit

Permalink
docs(backtesting-sp500): fix some audit questions and some of the rea…
Browse files Browse the repository at this point in the history
…dme sentences
  • Loading branch information
MSilva95 committed Nov 27, 2023
1 parent c523775 commit ab5e280
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 8 deletions.
10 changes: 5 additions & 5 deletions subjects/ai/backtesting-sp500/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## SP500 data preprocessing

The goal of this project is to perform a Backtest on the SP500 constituents. The SP500 is an index the 500 biggest capitalization in the US.
The goal of this project is to perform a Backtest on the SP500 constituents, which represents the 500 largest companies by market capitalization in the United States.

## Data

Expand All @@ -12,12 +12,12 @@ The input files are:

- [`stock_prices.csv`](./data/stock_prices.csv): contains the close prices for
all the companies that had been in the SP500. It contains a lot of missing
data.
data.

The adjusted close price may be unavailable for three main reasons:

- The company doesn't exist at date `d`
- The company is not public
- The company is not publicly traded
- Its close price hasn't been reported

_Note: The quality of this data set is not good: some prices are wrong, there are some prices spikes, there are some prices adjustments (share split, dividend distribution) - the price adjustment is corrected in the adjusted close. This data is not provided for this project to let you understand what is bad quality data and how important it is to detect outliers and missing values. The idea is not to correct the full data set manually, but to correct the main problems._
Expand Down Expand Up @@ -73,11 +73,11 @@ There are four parts:

## 2. Data wrangling and preprocessing

- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least:
- Create a Jupyter Notebook to analyze the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least:

- Missing values analysis
- Outliers analysis (there are a lot of outliers)
- One of average price for companies for all variables (save the plot with the images).
- Visualize and analyze the average price for companies over time or compare the price consistency across different companies within the dataset. Save the plot as an image.
- Describe at least 5 outliers ('ticker', 'date', 'price'). Put them in `outliers.txt` file with the 3 fields on the folder `results`.

_Note: create functions that generate the plots and save them in the `images` directory. Add a parameter `plot` with a default value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots._
Expand Down
10 changes: 7 additions & 3 deletions subjects/ai/backtesting-sp500/audit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@ project

###### Does the notebook contain a Histogram of average price for companies for all variables (saved the plot with the images)? This is required only for **prices.csv** data.

###### Does the notebook describe at least 5 outliers ('ticker', 'date', price) ? To check the outliers it is simple: Search the historical stock price on Google at the given date and compare. The price may fluctuate a bit. The goal here is not to match the historical price found on Google but to detect a huge difference between the price in our data and the real historical one.
###### Does the notebook identify and describe at least 5 outliers ('ticker', 'date', 'price') by cross-referencing historical stock prices from an external source such as Google?

> Note: This approach aims not to precisely match the historical price but to detect substantial differences between our dataset and verified historical data.
##### Notes:

Expand All @@ -57,7 +59,7 @@ project

##### 2. preprocessing.py

###### Is the data agregated on a monthly period and only the last element is kept?
###### Is the data aggregated on a monthly period and only the last element is kept?

###### Are the outliers filtered out by removing all prices bigger than 10k$ and smaller than 0.1$?

Expand All @@ -67,7 +69,9 @@ project

###### Are the outliers in the returns data set to NaN for all returns not in the years 2008 and 2009? The filters are: return > 1 and return < -0.5.

###### Are the missing values filled using the last value available **for the company**. **df.fillna(method='ffill')** is wrong because the previous value can be the return or price of another company.
###### Are the missing values filled using the last value available for **for the company**?

Note: Simply applying df.fill() might be insufficient if the DataFrame isn't properly grouped or sorted by company/ticker. Ensure that the missing values are filled within each company's data independently to avoid potential mixing of information from different companies.

###### Are the missing values that can't be filled using a the previous existing value dropped?

Expand Down

0 comments on commit ab5e280

Please sign in to comment.