This project aims to predict automobile prices based on a dataset containing information about various automobile attributes. It involves data analysis, preprocessing, modeling, and interpretability techniques. The primary goal is to develop a regression model that accurately predicts the price of automobiles.
The dataset used in this project is named 'autos.csv.' It contains information about automobiles, including features such as 'price,' 'yearOfRegistration,' 'powerPS,' 'kilometer,' 'model,' 'vehicleType,' 'gearbox,' and more. The dataset is used to train and evaluate the regression model.
The project begins with data analysis, where we explore the dataset to understand its structure. This includes examining the shape, columns, data types, and summary statistics of the dataset.
Data preprocessing involves several steps:
- Handling missing values in categorical features.
- Extracting date-related features from 'dateCreated,' 'dateCrawled,' and 'lastSeen.'
- Handling outliers in specific columns.
- Feature selection to remove constant and quasi-constant features.
The project implements and evaluates several regression models, including:
- Linear Regression
- Decision Tree Regression
- Random Forest Regression
- XGBoost Regression
The models are trained to predict automobile prices based on the dataset's features.
The models' performance is evaluated on a test dataset. Metrics like R-squared (R2), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) are used to assess model accuracy and generalization.
To gain insights into the models' decision-making processes, SHAP (SHapley Additive exPlanations) values are used for interpretability. SHAP values help understand feature importances and the impact of each feature on predictions.
You can use this project to:
- Predict automobile prices based on given attributes.
- Analyze the importance of different features in predicting prices.
- Customize and improve the regression models.
Feel free to adapt and extend the code and analysis for your specific needs.