Data Mining

Data recovery and classification techniques using Pandas, Scikit, Keras and Pytorch.

In this repo, there are 2 projects implemented to satisfy the Data Mining course:

The first project uses the Red Wine Quality dataset was used to classify the quality of red wines. Red wine has some attributes, which can be used to estimate the quality of the red wine. The scale of the quality is 0 to 10 describing a bad wine quality and an excellent wine quality respectively.
The second project uses the Onion or not dataset was used to classify news headlines. There are fake news headlines and legit news headlines. The target is to find patterns in the each headline, using the words (or tokens) of all the headlines.

The repo is organised as follows:

Red Wine Quality Project
- data-mining-part-a-svm-simple.ipynb: Jupyter notebook that uses SVM technique to classify the quality of the Red Wine dataset
- data-mining-part-a-svm-with-preprocessing.ipynb: Jupyter notebook where apart from classification, there is also some data preprocessing done, which helps the SVM classifier afterwards
- data-mining-part-a-svm-without-pH.ipynb: Jupyter notebook where a column is dropped first, and then SVM is performed to classify the quality column like before
- data-mining-part-a-svm-mean-completion.ipynb: Jupyter notebook where the pH column is distorted. There is an attempt to restore the integrity of the data using the mean values of the column to fill in the corrupt data cells. Finally, the SVM classifies the red wine quality
- data-mining-part-a-logistic.ipynb: Jupyter notebook where the pH column is distorted. There is an attempt to restore the integrity of the data using logistic regression to predict the corrupt data cells data. Finally, the SVM classifies the red wine quality
- data-mining-part-a-kmeans.ipynb: Jupyter notebook where the pH column is distorted. There is an attempt to restore the integrity of the data using k-means clustering algorithm to fill in the corrupt data cells. Finally, the SVM classifies the red wine quality
- winequality-red.csv: The dataset of the project
Onion or not Project
- data-mining-part-b-preprocess-data.ipynb: Jupyter notebook where data is preprocessed. There are techniques implemented, such as stemming, stopwords removal and tokenization, which prepare the data and make it compatible to the classifier aftwerwards
- Classifiers implemented in two different frameworks for academic purposes:
  - data-mining-part-b-nn-keras.ipynb: Jupyter notebook where a neural network is implemented in Keras to classify the preprocessed data
  - data-mining-part-b-nn-pytorch.ipynb: Jupyter notebook where a neural network is implemented in Pytorch to classify the preprocessed data
- Due to RAM limitations, the classifiers were re-implemented and merged with data preprocessing. The file exported after data preprocessing was 4 GB large. After observing memory changes, merging the two parts into one made the overall code lighter. Thus, the files where merging is done are:
  - combined-v.01-heavy.ipynb: Jupyter notebook where a neural network is implemented in Pytorch to classify the preprocessed data. This neural network was heavier than needed. Therefore, there was an attempt to make the classifier lighter
  - combined-v.01-light.ipynb: Jupyter notebook where lighter implementation of the neural network is implemented
  - onion-or-not.csv: The dataset of the project

The rest of the repo files help mostly the developer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Mining

Data recovery and classification techniques using Pandas, Scikit, Keras and Pytorch.

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitattributes		.gitattributes
README.md		README.md
cols.csv		cols.csv
combined-v.01-heavy.ipynb		combined-v.01-heavy.ipynb
combined-v.01-light-underfit.ipynb		combined-v.01-light-underfit.ipynb
combined-v.01-light.ipynb		combined-v.01-light.ipynb
data-mining-part-a-kmeans.ipynb		data-mining-part-a-kmeans.ipynb
data-mining-part-a-logistic.ipynb		data-mining-part-a-logistic.ipynb
data-mining-part-a-svm-experiments.ipynb		data-mining-part-a-svm-experiments.ipynb
data-mining-part-a-svm-mean-completion.ipynb		data-mining-part-a-svm-mean-completion.ipynb
data-mining-part-a-svm-simple.ipynb		data-mining-part-a-svm-simple.ipynb
data-mining-part-a-svm-with-preprocessing.ipynb		data-mining-part-a-svm-with-preprocessing.ipynb
data-mining-part-a-svm-without-pH.ipynb		data-mining-part-a-svm-without-pH.ipynb
data-mining-part-b-nn-keras.ipynb		data-mining-part-b-nn-keras.ipynb
data-mining-part-b-nn-pytorch.ipynb		data-mining-part-b-nn-pytorch.ipynb
data-mining-part-b-preprocess-data.ipynb		data-mining-part-b-preprocess-data.ipynb
evaluation.csv		evaluation.csv
onion-or-not.csv		onion-or-not.csv
small-preprocessed_onion_data-500-rows.csv		small-preprocessed_onion_data-500-rows.csv
winequality-red.csv		winequality-red.csv

AndreasKaratzas/data_mining

Folders and files

Latest commit

History

Repository files navigation

Data Mining

Data recovery and classification techniques using Pandas, Scikit, Keras and Pytorch.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages