Skip to content

Commit

Permalink
Merge pull request #734 from siddhant4ds/hotel-booking-demand
Browse files Browse the repository at this point in the history
Hotel Booking Demand Prediction
  • Loading branch information
abhisheks008 authored Nov 14, 2024
2 parents e15277c + 4efab84 commit 537ba2e
Show file tree
Hide file tree
Showing 18 changed files with 119,535 additions and 0 deletions.
7 changes: 7 additions & 0 deletions Hotel Booking Demand Prediction/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Hotel Booking Demand Dataset

**Source**: [Hotel Booking Demand - Kaggle](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand)

**Description**:
This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.
All personally identifying information has been removed from the data.
119,391 changes: 119,391 additions & 0 deletions Hotel Booking Demand Prediction/Dataset/hotel_bookings.csv

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
102 changes: 102 additions & 0 deletions Hotel Booking Demand Prediction/Model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Hotel Booking Demand Prediction

## 🎯 **Goal**

Predicting cancellations based on booking data to estimate demand for hotel rooms.

## 🧵 **Dataset**

[Hotel Booking Demand Dataset](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand)

## 🧾 **Description**

This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. The problem is binary classification of cancellation status to estimate hotel booking demand.

## 🧮 **What I have done**

1. Exploratory analysis of features: cleaning, preprocessing and data visualization.
2. Feature engineering:
* re-categorizing categorical features based on target splits
* target-encoding high-cardinality categorical features
* discretizing numerical features with low number of unique values
3. Feature selection:
* Statistical tests - Pearson correlation, Mutual information scores, ANOVA F-test, Chi-squared test of independence
* Model-based feature importances using Extremely-Randomized Trees.
4. Created a holdout set for testing using Stratified sampling to maintain imbalance ratio.
5. Training and validation of: Logistic Regression, Naive Bayes, K-nearest neighbours, Decision Tree, Random Forest, AdaBoost, Multi-Layer Perceptron, and gradient-boosting trees (XGBoost, CatBoost, LightGBM).
6. Model ensembling using averaging of predictions with different configurations.
7. Models were tuned and evaluated based on ROC-AUC score instead of Accuracy, since the target classes are imbalanced.

## 🚀 **Models Implemented**

* Logistic Regression
* Naive Bayes: Gaussian
* K-Nearest Neighbours
* Decision Tree
* Random Forest
* AdaBoost
* Neural network: Multi-layer Perceptron
* Gradient-boosting models: XGBoost, CatBoost, LightGBM
* Model Ensembling: Simple/Power/Weighted averaging

## 📚 **Libraries Needed**

* Pandas
* Numpy
* Scikit-learn
* XGBoost
* CatBoost
* LightGBM
* Matplotlib
* Seaborn

## 📊 **Exploratory Data Analysis Results**

**Feature distributions**
![Image](../Images/featdist_leadtime.png)
![Image](../Images/featdist_arrivalweek.png)
![Image](../Images/featdist_arrivaldayofmonth.png)
![Image](../Images/featdist_staysweekend.png)
![Image](../Images/featdist_staysweekday.png)
![Image](../Images/featdist_totalstay.png)
![Image](../Images/featdist_adults.png)
![Image](../Images/featdist_adr.png)

**Feature selection**:
Correlation between features:
![Image](../Images/featselect_corrfeatures.png)
Correlation with target:
![Image](../Images/featselect_corrtarget.png)
Mutual Information:
![Image](../Images/featselect_mutualinfo.png)
Model-based feature importances:
![Image](../Images/featselect_modelfimp.png)

## 📈 **Performance of the Models**

Models were evaluated based on ROC-AUC score due imbalanced class ratio.

| Model configuration | ROC-AUC Score
|:-----|:-----:
| Logistic Regression | 0.8470
| Gaussian Naive Bayes | 0.7944
| K-Nearest Neighbours | 0.8810
| Decision Tree | 0.8820
| Random Forest | 0.8958
| AdaBoost | 0.8959
| Multi-layer Perceptron | 0.9039
| XGBoost | 0.9138
| LightGBM | 0.9146
| CatBoost | 0.9154
| Simple averaging | 0.9108
| Power averaging | 0.9062
| **Weighted averaging** | **0.9159**

## 📢 **Conclusion**

Trained a variety of models and created ensembles using averaging methods. Used ROC-AUC score to evaluate for imbalanced classification, and the best performance was shown by the Weighted-averaging ensemble.

## ✒️ **Your Signature**

Siddhant Tiwari
([Github](https://www.github.com/siddhant4ds) - [Kaggle](https://www.kaggle.com/sid4ds) - [LinkedIn](https://www.linkedin.com/in/siddhant-tiwari-ds/))

Large diffs are not rendered by default.

26 changes: 26 additions & 0 deletions Hotel Booking Demand Prediction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Hotel Booking Demand Prediction

## Project structure

.
├── Dataset
│   ├── hotel_bookings.csv
│   └── README.md
├── Images
│   ├── featdist_adr.png
│   ├── featdist_adults.png
│   ├── featdist_arrivaldayofmonth.png
│   ├── featdist_arrivalweek.png
│   ├── featdist_leadtime.png
│   ├── featdist_staysweekday.png
│   ├── featdist_staysweekend.png
│   ├── featdist_totalstay.png
│   ├── featselect_corrfeatures.png
│   ├── featselect_corrtarget.png
│   ├── featselect_modelfimp.png
│   └── featselect_mutualinfo.png
├── Model
│   ├── eda_modeling_ensembling.ipynb
│   └── README.md
├── requirements.txt
└── README.md
8 changes: 8 additions & 0 deletions Hotel Booking Demand Prediction/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
pandas==2.2.1
numpy==1.26.4
matplotlib==3.8.4
seaborn==0.13.2
scikit-learn==1.5.0
xgboost==2.1.0
catboost==1.2.5
lightgbm==4.5.0

0 comments on commit 537ba2e

Please sign in to comment.