Table of Contents
The goals of this week are to:
- understand the overall workflow of a machine learning project
- to use scikit learn to implement a supervised classifier for your project
- evaluate your approach on your labeled dataset
Today we will extract some features from our data and perform an initial classification experiment.
See the starter notebook: https://github.com/tapilab/elevate-osna-starter/tree/master/notebooks/W2L1.ipynb
Continue working on your notebook from the last lab. Do the following:
- Use CountVectorizer to create a matrix of all terms. Experiment with the following to see the affect on accuracy:
min_df
: [1,2,5,10]max_df
: [1, .95, .8]ngram_range
: [(1,1), (1,2), (1,3)]
- Experiment with different regularization for LogisticRegression
C
: [.1, 1, 5, 10]penalty
: [l1, l2]
- Summarize your results with a table for each setting, like this:
C | Accuracy |
---|---|
.1 | xxx |
1 | xxx |
- Vary one parameter at a time, while using the defaults for the rest.Let the defaults be (min_df=2, max_df=1, ngram_range=(1,1), C=1, penalty=l2).
Today we will (1) determine the best version of the classifier that we can find, (2) fit the classifier, (3) load it in the web app, (4) classify the tweets.