Chapter 3: Contents of spam.df don't match output in book #35

ChrisHowlin · 2015-10-17T15:54:17Z

In Chapter 3 we construct a spam filter based on the data in the folder:

ML_for_Hackers/03-Classification/data/spam

In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with html at the top:

head(spam.df[with(spam.df, order(-occurrence)),])

	term	frequency	density	occurrence
2122	html	377	0.005665595	0.338
538	body	324	0.004869105	0.298
4313	table	1182	0.017763217	0.284
1435	email	661	0.009933576	0.262
1736	font	867	0.013029365	0.262
1942	head	254	0.003817138	0.246

When running the code directly, this does not match the output I get with email at the top:

	term	frequency	density	occurrence
7781	email	813	0.005853680	0.566
18809	please	425	0.003060042	0.508
14720	list	409	0.002944840	0.444
27309	will	828	0.005961681	0.422
3060	body	379	0.002728837	0.408
9457	free	539	0.003880853	0.390

This seems to be explained by the way the document vectors are processed with the removePunctuation setting. This punctuation is removed and any terms which were separated would now be a new term. For example, becomes htmlhead. The result is that instead of html being listed as a common term in many of the emails, we have lots of low frequency combination of html with other HTML tag keywords.

The text was updated successfully, but these errors were encountered:

IbrahimZamit · 2016-05-08T17:12:25Z

@ChrisHowlin and what seems to be the solution for this issue in order to obtain the same results as the book ??!!!

NumberOne925 · 2017-12-11T00:07:45Z

Do you guys know, why i have a different results than in the book? why is this happening?

pythonandr · 2017-12-28T07:26:56Z

@NumberOne925
I got the same result as yours.
I think it is normal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 3: Contents of spam.df don't match output in book #35

Chapter 3: Contents of spam.df don't match output in book #35

ChrisHowlin commented Oct 17, 2015

IbrahimZamit commented May 8, 2016

NumberOne925 commented Dec 11, 2017

pythonandr commented Dec 28, 2017

Chapter 3: Contents of spam.df don't match output in book #35

Chapter 3: Contents of spam.df don't match output in book #35

Comments

ChrisHowlin commented Oct 17, 2015

IbrahimZamit commented May 8, 2016

NumberOne925 commented Dec 11, 2017

pythonandr commented Dec 28, 2017