This project have be develop on data collected from InsideAirBnB, that are separated in listings and reviews (comments on the listings).
- data_analysis: contains notebooks with the analysis of the data.
- dataset: contains the data and some custom classes to work with them.
- listings: contains the data relative the listings.
- comments: contains the data relative the listing’s reviews of comments.
- embeddings: contains the processed data in pikles that we obtain in the intermediate steps of the data_processing.
- model/models: contains custom Neural Networks model developed for the project.
- processing: contains notebooks used to preprocess the data cleaning them or generate embeddings using Sentence Models.
- utils: contains various utils to process the data, special note for the amenities a special field present in the listing processed using clusters.
- visualization: contains custom modules to visualize the data.
This code have been developed on python 3.11.3, we recommend an equal mayor related version
Environment setup
- Using Pip
pip install -r requirements.txt
- Using Conda
conda env create -f environment.yml
We need to placed the dataset downloaded from InsideAirBnB, inside the dataset folders.
- The lising.csv must be placed inside dataset/listings folder, you can place more than one all the csv files in the folder will be used (already in the folder).
- The reviews.csv must be placed inside dataset/comments folder, you can place more than one all the csv files in the folder will be used (already in the folder).
Run in order the processing steps:
- step1_merge_listings_comments.ipynb: connects the listings and reviews togheter.
- step2_process_columns.ipynb: generates the embeddings for the comments-review for the listings.
- step3_process_comments.ipynb: generates the embeddings, as well as, the processed ordinal and numeric data. E.G., prices or listing type (Apartment, Home, etc...).
- step4_extraction_of_test_set.ipynb: merge the embeddings and pre-processed data and generate the train and test dataset files.
Simply open any notebook in the data_analysis and run it, only remember that the analysis requires the embeddings to be computed. As such you need the data pre-processing first.
You can simply run any notebook to evaluate our models on the data provided the experiments are separated as follow: