You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Chapter 3 we construct a spam filter based on the data in the folder:
ML_for_Hackers/03-Classification/data/spam
In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with html at the top:
head(spam.df[with(spam.df, order(-occurrence)),])
term
frequency
density
occurrence
2122
html
377
0.005665595
0.338
538
body
324
0.004869105
0.298
4313
table
1182
0.017763217
0.284
1435
email
661
0.009933576
0.262
1736
font
867
0.013029365
0.262
1942
head
254
0.003817138
0.246
When running the code directly, this does not match the output I get with email at the top:
term
frequency
density
occurrence
7781
email
813
0.005853680
0.566
18809
please
425
0.003060042
0.508
14720
list
409
0.002944840
0.444
27309
will
828
0.005961681
0.422
3060
body
379
0.002728837
0.408
9457
free
539
0.003880853
0.390
This seems to be explained by the way the document vectors are processed with the removePunctuation setting. This punctuation is removed and any terms which were separated would now be a new term. For example, becomes htmlhead. The result is that instead of html being listed as a common term in many of the emails, we have lots of low frequency combination of html with other HTML tag keywords.
The text was updated successfully, but these errors were encountered:
In Chapter 3 we construct a spam filter based on the data in the folder:
ML_for_Hackers/03-Classification/data/spam
In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with html at the top:
head(spam.df[with(spam.df, order(-occurrence)),])
When running the code directly, this does not match the output I get with email at the top:
This seems to be explained by the way the document vectors are processed with the
removePunctuation
setting. This punctuation is removed and any terms which were separated would now be a new term. For example, becomes htmlhead. The result is that instead of html being listed as a common term in many of the emails, we have lots of low frequency combination of html with other HTML tag keywords.The text was updated successfully, but these errors were encountered: