-
Notifications
You must be signed in to change notification settings - Fork 10
/
machine-learning.txt
401 lines (314 loc) · 17.3 KB
/
machine-learning.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
Get the Data
Create Isolated Environment
conda create --name learning
conda info --envs
conda activate learning
conda list
conda activate base
conda env remove -n learning
conda install jupyter matplotlib numpy pandas scipy scikit-learn
Get the Data
Download the Data
Write a samll function to:
Download a single compressed file, housing.tgz which contains comma-separated value (CSV)
Decompress the file and extract the CSV
It is usefull to have function for these repeatable steps
if you need to install the data set on multiple machines
If the data changes regularly
Get the Data
Take a quick Look at the Data Strcture
https://www.youtube.com/watch?v=P4F3PzCMrtk
1. Take a look at the top 5 rows housing.head()
2. Get quick description of the data housing.info()
3. Explore Categorical attributes housing["ocean_proximity"].value_counts()
4. Explore Numrical attributes housing.describe()
std: measures how dispersed the values are
it is root of the variance, which is the avarage of the squared deviation from the mean
when feature has bell-shaped "68, 95, 99.7" rule applies
mean += std, mean += 2std, mean += 3std
percentiles: indicates the value below a given percentage of observations
5. Plot histogram for numrical attributes
y: number of instances
x: given value range
Plot histogram for specific attribute housing['median_house_value'].hist(bins=50, figsize=(20,15))
Plot histogram for each numrical attribute housing.hist(bins=50, figsize=(20,15))
Notes
1. median income attribute is
scaled
at 10,000 of USD
capped
15 for higher median incomes (actually 15.0001)
0.49 for lower median incomes (actually 0.49999)
2. housing_median_age, and median_house_value are also capped
Machine Learning algorithm may learn that prices never go beyon that limit
If this is problem you have 2 options
a. Collect proper lables for districts whose labels were capped
b. Remove those districts from the training, and testing sets
3. attributes have very different scales, we will explore feature scalling later.
4. Many histograms are tail heavy, we will try transforming these attributes later to have more bell-shaped distributions as this shape make it harder for machine learning algorithm to detect patterns.
Hopefully you now have a better understanding of the kind of data you are dealing with
but that enough for now, you need to create a test set, put it aside and never look at it.
Get the Data
Create a Test Set
Data Snooping Bias
- Your brain is an amazing pattern detection system, it is highly prone to overfitting: if you look at the test set
- You may see interesting pattern in the test data that leads you to select specific model
- Then Generalization Error of Test set will be too optimistics
Create a Test Set
Theoretically quite simple, just pick some instances randomly, typically 20% and set them aside:
CODING
Work! But it is not perfect
Because when run the program again, it will generate a different test set! then you get to see the whole dataset, which is what you want to avoid
Solutions
1. Save the test set on the first run then load it in subsequent runs
2. Set the random number generator's seed np.random.seed(42)
Work! But it is not perfect
Because both will break next time you fetch an updated dataset.
Solutions
1. Use Instance's identifier to decide whether or not it should go in the test set
2. Compute a hash of each instance and put that instance in test setif the hash is lower or equal to 20% of the maximum hash value
CODING
Discover and Visualize the Data to Gain Insights
Visualizing Geographical Data
- Draw Geographical Information
Create Scatterplot of all districts to visualize the data
Play with alpha to make it easier to visualize places where is high density
Let make radius of circle represents the district's population
Color represent the price ranges from blue (low values) to red (high prices)
https://www.youtube.com/watch?v=P4F3PzCMrtk
Findings
1. Housing Price are very much related to the location and population density
2. We could use clustering algorithm to detect main clusters and use it as new feature for input, but North prices are not too high
3. We could use Ocean proximity attribute as well
Discover and Visualize the Data to Gain Insights
Looking for Correlations
Correlation coefficient ranges from 1 to -1
Pearson's r means if x goes up, then y goes up/down
1 strong positive linear correlation
-1 strong negative linear correlation
0 there is no linear correlation
Scatter Matrix plot numrical attribute against every other numrical attributes
Define most promising attributes that seem most correlated with median housing value
Main diagonal will not be useful as it the attribute against itself Pandas draw the histogram of the attribute
median_income is the most promising attribute to predict median_house_value
Findings
1. Correlation is indeed very strong
2. Upward trend is clear, and points are not too dispersed
3. Price cap at 500,000$ that appear in histogram
4. There are another lines on 450,000$, 350,000$, 280,000$
Discover and Visualize the Data to Gain Insights
Experimenting with Attribute Combinations
You need to combin attributes with eachother and restudy the correlation searching for another relations
total_rooms/households, total_bedrooms/total_rooms, population/household
These proposed comibniation could be helpfull Why?
- lower bedroom/room houses tend to be more expensive.
- rooms/household is more informative than total_rooms/district
- larger houses more expensive
These kind of exploration, combiniation and interpretation need to be iterative process
Prepare the Data for Machine Learning Algorithm
- Separate the predictors and the labels as it is not necessary to apply the same transformation to both
housing = strat_train_set.drop('median_house_value', axis=1)
housing_labels = strat_train_set['median_house_value'].copy()
Prepare the Data for Machine Learning Algorithm
Data Cleaning
- Take care of missing values of total_bedrooms as most ML Algorithm cannot work with missing features we have 3 options
1. Remove Corresponding Districts
2. Remove whole attributes
3. Set the values to some value (zero, mean, median, etc)
- You need to calculate mean on the training set and use it fill up missing values
- Save mean to use on test set when you evaluate the model, and on live system when new data is came
- Scikit-Lean provide handy class to fill up missing value with whatever strategy you choose
- Since median computed only on numrical attribute, we need to drop ocean_proximity as it is categorical
- SimpleImputer compute median of each attribute and store the result in statistics_
- It is safer to apply imputer to all numerical attributes
- call fit method to calculate mean value for all numrical attributes
- call transform method to apply mean to missing values
Prepare the Data for Machine Learning Algorithm
Handling Text and Categorical Attributes
- Most machine learning algorithm prefer to work with number
- Use fit_transform from OrdinalEncoder to convert texts to numbers
- Check categories_ attribute to list all categories
Finding
- Machine Learning will learn that nearby values are more similar than distant values
- Which may be good in some cases like (bad, average, good, excellent)
- But it is not the case with ocean_proximity
One-Hot Encoding
Create vector with length equal to number of category
Set 1 (hot) corresponding to category index
Set 0 (cold) elsewhere
Scikit-Learn provide OneHotEncoder class to convert categorical values into one-hot vectors
Using fit_transform from OneHotEncoder return SciPy sparse matrix
https://developpaper.com/instructions-for-using-python-scipy-sparse-matrix/
Much efficient for memory usage
It store location of non-zero element, Instead of store wastefull zeros
Prepare the Data for Machine Learning Algorithm
Custom Transformers
You need to write your transforms for tasks such as custom cleanup operations or combining specific attributes.
To make custom transformers work seamlessly with Scikit-Learn functionalities (such as pipelines) you need to implement fit() (returning self), transform(), and fit_transform().
You can get fit_transform() free by add TransformerMixin as base class
You can achive automatic hyperparameter tuning by add BaseEstimator as another base class which provide get_params() and set_params()
CODING
We create one hyperparameter, add_bedrooms_per_room set to True by default
It is good to provide sensible defaults
Hypterparameter are parameters that are part of Algorithm not part of the Learning
We can add hyperparameter that gate any data preparation step that you are not 100% sure about.
More automation more likely you find great combination and Saving you a lot of time
Prepare the Data for Machine Learning Algorithm
Feature Scaling
Mostly Machine Learning algorithms don’t perform well with numerical input attributes that have very different scales.
Feature Scaling need to be done on input to fix such these difrences total_rooms range 6 to 39,320, median_income range 0 to 15
Generally scaling not required for output or target values
We have to common ways
- min-max scaling
- Standardization
Min-Max Scaling (Normalization)
Simply values are shifted and rescaled so that they end up ranging from 0 to 1.
We do that by (x - min(X)) / max(X) - min(X)
Scikit-Learn provide MinMaxScaler transformer which has feature_range hyperparameter that allow you to change the range of values
Standardization
We do that by (x - mean(X)) / std(X)
- Values always have zero mean
- Values resulting distribution has unit variance (std = 1)
- Much less affected by outliers
- Does not bound values to specific range (sometimes this is problem eg neural networks often expect input range 0 to 1)
- Scikit-Learn provide StandardScaler transformer
Prepare the Data for Machine Learning Algorithm
Transformation Pipelines
As we have seen there are many data transformation steps that need to be executed in the right order (Imputer, Attributes Adder, Std Scaler)
Pipline constructor takes list of tuple (name, transformer) defining a sequence of steps
All steps by the last must be transformers (they must have fit_transform() method)
When fit() method called all transformers fit_transform() methods are called sequentially passing output of each call as parameter to the next call
Final estimator just calls the fit() method.
Pipeline exposes same method as final estimator
Scikit-Learn provide ColumnTransformer to handle numerical columns separately from categorical columns
- Constructor require list of tuples (name, transformer or drop or passthrough, list of columns (names or indices)) that the transformer should apply to
- Transformer return same number of rows, but may different number of columns
- When such mix of sparse and dense matrices are exists ColumnTransform estimates the density of the final matrix (ratio of non zero cells)
- Return sparse matrix if sparse_threshold=03 is exceeded
Select and Train a Model
We framed the problem
Got the data and explored it
Sampled a training and test sets.
Wrote transformation pipelines to clean up and prepare data
And we are ready to select and traing model
Select and Train a Model
Training and Evaluating on the Training Set
Thanks to all previous steps that make things much simpler than we might think
Train first Linear Regression model is just like 3 lines of code
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
Done! Let's try it out
Select and Train a Model
Training and Evaluating on the Training Set
Works, but predictions are not exactly accurate
Let's measure RMSE on the whole training set
Scikit-Learn provide mean_squared_error
Select and Train a Model
Training and Evaluating on the Training Set
Most districts’ median_housing_values range between $120,000 and $265,000
Prediction error of $68,628 is not very satisfying
But! What exactly the problem is?
Underfitting
- Features do not provide enough information to make good predictions
- model is not powerful enough.
But first let’s try a more complex model
Select and Train a Model
Training and Evaluating on the Training Set
DecisionTreeRegressor. is a powerful model, capable of finding complex nonlinear relationships in the data
Let's train the model, and evaluate it on the training set
No error at all? Could this model really be absolutely perfect?
it is much more likely that the model has badly overfit the data
How can we make sure? Without touching the test set?
- Use part of training set for training
- And part for model Validation
Select and Train a Model
Better Evaluation Using Cross-Validation
One way to evaluate the Decision Tree is using train_test_split
Split the training set into a smaller training set and a validation set
Training models against the smaller training set and evaluate them against the validation set.
It's bit of work, but nothin too difficult and it would work fairly well
Alternativly is using Scikit-Learn’s K-fold cross-validation feature.
Randomly splits training set into 10 distinct subsets called folds
Train the model on 9 folds, Validate on 1 and rotate
Results array of 10 evaluation scores
Select and Train a Model
Better Evaluation Using Cross-Validation
Decision Tree look worse than the Linear Regression model!
Cross-validation allow to get estimate of the performance and measure of how precise this estimate (i.e., its standard deviation).
Decision Tree score of approximately 71,407, generally ±2,439
Cross-validation comes at the cost of training the model several times, so it is not always possible.
Comparing mean of error tell us that Decision Tree overfit so badly
Select and Train a Model
Better Evaluation Using Cross-Validation
Let’s try one last model RandomForestRegressor
Random Forests work by training many Decision Trees on random subsets of features, and Averaging their predictions
Ensemble Learning: Building a model on top of many other models
Select and Train a Model
Better Evaluation Using Cross-Validation
Finding
Random Forests look very promising.
Score on training set is much lower than on validation sets
Model is still overfitting the training set
Overfitting
- Simplify model, constrain it (i.e., regularize it)
- Get more training data
Select and Train a Model
Better Evaluation Using Cross-Validation
Try out other models from various categories, Without spending much time tweaking the hyperparameters
The goal is to shortlist the promising models
Select and Train a Model
Better Evaluation Using Cross-Validation
You need to save the model you experiment
Saving both hyperparameter and trained parameters
Fine-Tune Your Model
We now have a shortlist of promising models.
We need to fine-tune them.
Let’s look at a few ways you can do that.
Fine-Tune Your Model
Grid Search
We could fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values.
Tedious work, lot of time to explore all combinations.
Scikit-Learn’s GridSearchCV could search for you
You provide hyperparameters you want it to experiment
And what values to try them out
And it will evaluate all the possible combinations of hyperparameter values, using cross-validation.
Fine-Tune Your Model
Grid Search
GridSearchCV is initialized with refit=True
Retrains it on the whole training set when best estimator is found
Good idea as feeding the mode more data will likely improve its performance
Fine-Tune Your Model
Randomized Search
When the hyperparameter search space is large, it is often preferable to use RandomizedSearchCV
It evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration
This approach has two main benefits:
- This approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the grid search approach).
- You have more control over the computing budget you want to allocate to hyperparameter search, simply by setting the number of iterations.
Fine-Tune Your Model
Analyze the Best Models and Their Errors
We could gain good insights about the problem by inspecting the best models
RandomForestRegressor provide relative importance of each attribute for making accurate predictions:
>>> feature_importances = grid_search.best_estimator_.feature_importances_
>>> feature_importances
Fine-Tune Your Model
Analyze the Best Models and Their Errors
We may want to try dropping some of the less useful features (e.g. ocean_proximity)
We should look at the specific errors that our system makes
Try to understand why it makes them and what could fix the problem
Adding extra features or, getting rid of uninformative ones, cleaning up outliers, etc.
Fine-Tune Your Model
Evaluate Your System on the Test Set
It is time to evaluate the final model on the test set.
Get the predictors and the labels from your test set
Run your full_pipeline to transform the data
call transform(), not fit_transform(), you do not want to fit the test set!
Evaluate the final model on the test set
Fine-Tune Your Model
Evaluate Your System on the Test Set
We want to have an idea of how precise the estimate.
For this, We can compute a 95% confidence interval for the generalization
>>> scipy.stats.t.interval()
When we did a lot of hyperparameter tuning the performance will usually be worse than what cross-validation say
When this happens we must resist the temptation to tweak the hyperparameter on test set
Any improvements will not be reflected on new data