-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multilingual spam classifier #33
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk = chunk[KEPT_FIELDS].dropna() | ||
|
||
chunk_spams = chunk[chunk["spam"] == True] | ||
chunk_spams["description"] = chunk_spams["description"].map(parse_description) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using .map()
here could be parallelized to make use of mult{threading,processing}.
|
||
1. Go to Zenodo Open Metadata record at <https://doi.org/10.5281/zenodo.787062> to acces all dataset versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep the reference to the dataset DOI, to make it easier to get hold of the (test) data for training.
src/utils/datasets_utils.py
Outdated
Returns: | ||
pd.DataFrame, pd.DataFrame: Train and test sets. | ||
""" | ||
dataset = dataset.sample(frac=1, random_state=SEED).reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a closer look to keep the ham/spam distribution in the splitter sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes! This is indeed not correct. It's not keeping the exact same distribution. We noticed it when pushing the code to our Github classroom repository and forgot to push the change to this one.
Here is the correct version of the function:
def split_train_test(
dataset: pd.DataFrame, test_size=0.2
) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""Split the dataset into train and test. It will keep exactly the same
distribution of classes in both train and test.
Args:
dataset (pd.DataFrame): Dataset to split.
test_size (float, optional): Percentage of the dataset to use for the
test set.
Returns:
pd.DataFrame, pd.DataFrame: Train and test sets.
"""
spams = dataset[dataset["label"] == 1]
hams = dataset[dataset["label"] == 0]
spams = spams.sample(frac=1, random_state=SEED)
hams = hams.sample(frac=1, random_state=SEED)
spams_train = spams[: int(len(spams) * (1 - test_size))]
spams_test = spams[int(len(spams) * (1 - test_size)) :]
hams_train = hams[: int(len(hams) * (1 - test_size))]
hams_test = hams[int(len(hams) * (1 - test_size)) :]
train = pd.concat([spams_train, hams_train])
test = pd.concat([spams_test, hams_test])
train = train.sample(frac=1, random_state=SEED)
test = test.sample(frac=1, random_state=SEED)
return train, test
While pushing to our GitHub classroom repository, we noticed some minors issues that we fixed. We forgot to push them to your repository. I just pushed a new commit with those fixes. This commit includes 4 smalls changes:
We've tested this last commits and it doesn't change the results (accuracy, F1-score, ...) |
This PR adds the production code for the multilingual spam classifier. The new README explains the scripts and should probably be reviewed first to understand the rest:
./src
,requirements.txt
to reflect the changes,README.md
to reflect the change and explain how to train and use,Makefile
to reflect the changes,.gitignore
.This PR is based on my previous one (#32) to facilitate the merging. If this previous PR isn't going to merge, let me know, and I will remove it from this one.