diff --git a/subjects/ai/backtesting-sp500/README.md b/subjects/ai/backtesting-sp500/README.md index 6ff25b8000..54fca3fb91 100644 --- a/subjects/ai/backtesting-sp500/README.md +++ b/subjects/ai/backtesting-sp500/README.md @@ -2,7 +2,7 @@ ## SP500 data preprocessing -The goal of this project is to perform a Backtest on the SP500 constituents. The SP500 is an index the 500 biggest capitalization in the US. +The goal of this project is to perform a Backtest on the SP500 constituents, which represents the 500 largest companies by market capitalization in the United States. ## Data @@ -12,12 +12,12 @@ The input files are: - [`stock_prices.csv`](./data/stock_prices.csv): contains the close prices for all the companies that had been in the SP500. It contains a lot of missing - data. + data. The adjusted close price may be unavailable for three main reasons: - The company doesn't exist at date `d` - - The company is not public + - The company is not publicly traded - Its close price hasn't been reported _Note: The quality of this data set is not good: some prices are wrong, there are some prices spikes, there are some prices adjustments (share split, dividend distribution) - the price adjustment is corrected in the adjusted close. This data is not provided for this project to let you understand what is bad quality data and how important it is to detect outliers and missing values. The idea is not to correct the full data set manually, but to correct the main problems._ @@ -73,11 +73,11 @@ There are four parts: ## 2. Data wrangling and preprocessing -- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least: +- Create a Jupyter Notebook to analyze the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least: - Missing values analysis - Outliers analysis (there are a lot of outliers) - - One of average price for companies for all variables (save the plot with the images). + - Visualize and analyze the average price for companies over time or compare the price consistency across different companies within the dataset. Save the plot as an image. - Describe at least 5 outliers ('ticker', 'date', 'price'). Put them in `outliers.txt` file with the 3 fields on the folder `results`. _Note: create functions that generate the plots and save them in the `images` directory. Add a parameter `plot` with a default value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots._ diff --git a/subjects/ai/backtesting-sp500/audit/README.md b/subjects/ai/backtesting-sp500/audit/README.md index 04475ae83c..be69537f3a 100644 --- a/subjects/ai/backtesting-sp500/audit/README.md +++ b/subjects/ai/backtesting-sp500/audit/README.md @@ -38,7 +38,9 @@ project ###### Does the notebook contain a Histogram of average price for companies for all variables (saved the plot with the images)? This is required only for **prices.csv** data. -###### Does the notebook describe at least 5 outliers ('ticker', 'date', price) ? To check the outliers it is simple: Search the historical stock price on Google at the given date and compare. The price may fluctuate a bit. The goal here is not to match the historical price found on Google but to detect a huge difference between the price in our data and the real historical one. +###### Does the notebook identify and describe at least 5 outliers ('ticker', 'date', 'price') by cross-referencing historical stock prices from an external source such as Google? + +> Note: This approach aims not to precisely match the historical price but to detect substantial differences between our dataset and verified historical data. ##### Notes: @@ -57,7 +59,7 @@ project ##### 2. preprocessing.py -###### Is the data agregated on a monthly period and only the last element is kept? +###### Is the data aggregated on a monthly period and only the last element is kept? ###### Are the outliers filtered out by removing all prices bigger than 10k$ and smaller than 0.1$? @@ -67,7 +69,9 @@ project ###### Are the outliers in the returns data set to NaN for all returns not in the years 2008 and 2009? The filters are: return > 1 and return < -0.5. -###### Are the missing values filled using the last value available **for the company**. **df.fillna(method='ffill')** is wrong because the previous value can be the return or price of another company. +###### Are the missing values filled using the last value available for **for the company**? + +Note: Simply applying df.fill() might be insufficient if the DataFrame isn't properly grouped or sorted by company/ticker. Ensure that the missing values are filled within each company's data independently to avoid potential mixing of information from different companies. ###### Are the missing values that can't be filled using a the previous existing value dropped?