Install Python3 and Jupyter-notebook in your system, copy all the 3 directories into your system, navigate to the directory using your command prompt and then run jupyter notebook. Make sure all the necessary python libraries such as Scikit-learn and YellowBrick are installed
There are 3 tasks and each task consists of a dataset and our objective is to make accurate predictions using classifiers. Proper preprocessing should be done before building a model and the model should make predictions that are accurate and do not overfit the data.
-
t1dectree.ipynb
-
t1.naivebayes.ipynb
-
t1randfor.ipynb
-
t1svm.ipynb
- version2.ipynb
- version3.ipynb
-
bayes.ipynb
-
sgd.ipynb
-
nearcent.ipynb
-
dectree.ipynb
-
knear.ipynb
-
randfor.ipynb
-
svm.ipynb
-
labprop.ipynb
-
c3bayes.ipynb
-
c3dectree.ipynb
-
c3knear.ipynb
Files: Training.txt, label_training.txt
Consists of sparse data represented by removing rows containing 0s
Data Format: Training.txt
-
Information Id
-
Feature Id
-
Value of the features
Data Format: label_training.txt
-
Classes -1, -1
-
1 -> information
-
-1 -> misinformation
Our goal was to fit the data using classification models, compute accuracy of the predicted
models and determine the optimal parameters.
-
Large number of features, n(features) > n(records).
-
Sparse data, most number of values are zero.
-
using pd.getdummies.
-
using pd.pivot table.
Used pd.pivot table and filled up the Nan values with zeros (Since Sparse).
-
Truncated Single Value Decomposition (SVD) - is a matrix factorization technique that factors a matrix M into the three matrices U, Σ, and V and it returns the principle components or features of the input dataframe or matrix.
-
Number of features returned - number specified in the function.
-
Suitable for sparse data with a large number of features.
-
Decision Tree Classifier
- Parameters tuned - max_depth = 4, min_samples_split = 276, criterion = 'gini'
-
Support Vector Machine Classifier(SVM)
- Parameters tuned - kernel = 'linear'
-
Naive Bayes - Bernoulli and Gaussian
- Parameter tuned - None
-
Random Forest Classifier
- Parameter tuned using RandomizedSearchCV
Data was provided by the NDRC
-
f-fin
-
hr_rad_temp_pr_hu_drt-10
Our goal was to predict solar radiation level at the site Mojave Desert Shrub. We used data such as temperature, air pressure, soil temperature, and time to predict the solar radiation levels. Firstly we need our data. Credit goes to the NDRC (Nevada Desert Research Center) for providing the data.
-
A lot of factors can affect solar radiation that are hard to quantify in data (clouds and humidity).
-
Discrepancies between sites due to things like elevation.
-
Accounting for night time in our data (temp vs solar radiation) (huge difference in accuracy).
-
Looking for features that have some relation to Solar Radiation (temperature vs humidity vs tipping bucket)
-
Don’t want to overfit but also don’t want to exclude valuable information.
-
Account for day and night (multiple approaches).
-
Used Excel to remove unwanted rows and columns
- Linear Regression
Alternatives:
-
Neural Network
-
Support Vector Machine
Identify Harmful radiation:
-
UV is identified as the main cause of sunburn.
-
However it does not encompass the whole electromagnetic spectrum presented by solar radiance.
-
IARC Working Group on the Evaluation of Carcinogenic Risk to Humans. Radiation. Lyon (FR): International Agency for Research on Cancer; 2012. (IARC Monographs on the Evaluation of Carcinogenic Risks to Humans, No. 100D.) SOLAR AND ULTRAVIOLET RADIATION. Available from: Link
Data: Used auto purchase Dataset
-
EXP REL Custom (Dictionary with column names and associated information)
-
UsedAutoRELEVATEfirst10000-noLatLong
Our goal was to predict:
-
Vehicle Type
-
Customer Type
-
Data consisting of values of different types - Float, String, DateTime etc.
-
Irrelevant Features - 280 Columns
-
Multiclass problem
-
Missing values
-
Skimmed through features and removed irrelevant ones.
-
Eliminated columns with missing values more than 5166.
-
Dealt with missing values using the following methods: Filling up with mean, preceding column values and zeros.
-
Selected columns excluding type float and transformed them to numbers, methods used: Label Encoding, One-Hot Encoding.
-
Eliminated features by going through the remaining features of the dataframe
-
Applied SelectKBest algorithm to select the best features that would most accurately predict labels.
-
Decision Tree Classifier
-
K-nearest Neighbors Classifier
-
Gaussian Naive Bayes Classifier
-Support Vector Machines Classifier
-
Logistic Regression Classifier
-
Label Propagation Classifier
-
Multi-layer Perceptron(Neural Networks)
-
Random Forest Classifier
-
Nearest Centroid Classifier
-
Stochastic Gradient Descent Classifier
Visualized for the best 2