doc: Update Readme.

The-Strategy-Unit · May 11, 2021 · f93025f · f93025f
1 parent 5abb316
commit f93025f
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 73 deletions.
diff --git a/README.md b/README.md
@@ -1,93 +1,40 @@
 # positive_about_change_text_mining
 
 ## Project description
-Nottinghamshire Healthcare NHS Foundation Trust hold  patient feedback that is currently manually labelled by our "coders" (i.e. the staff who read the feedback and decide what it is about). As we hold thousands of patient feedback records, we (the Data Science team) are running this project to aid the coders with a text classification model that will semi-automate the labelling process. Read more [here](https://involve.nottshc.nhs.uk/blog/new-nhs-england-funded-project-in-our-team-developing-text-mining-algorithms-for-patient-feedback-data/).
+Nottinghamshire Healthcare NHS Foundation Trust hold  patient feedback that is currently manually labelled by our "coders" (i.e. the staff who read the feedback and decide what it is about). As we hold thousands of patient feedback records, we (the [Data Science team](https://cdu-data-science-team.github.io/team-blog/about.html)) are running this project to aid the coders with a text classification pipeline that will semi-automate the labelling process. We are also working in partnership with other NHS trusts who hold patient feedback text. Read more  [here](https://involve.nottshc.nhs.uk/blog/new-nhs-england-funded-project-in-our-team-developing-text-mining-algorithms-for-patient-feedback-data/) and [here](https://cdu-data-science-team.github.io/team-blog/posts/2020-12-14-classification-of-patient-feedback/).
 
-This project will build and benchmark a number of text classification models using state-of-the-art Machine Learning (ML) packages in [`Python`](https://www.python.org/) and [`R`](https://www.r-project.org/). The final products will be the following:
+__We are working openly by open-sourcing the analysis code and data where possible to promote replication, reproducibility and further developments (pull requests are more than welcome!). We are also automating common steps in our workflow by shipping the pipeline as a [`Python`](https://www.python.org/) package broken down into sub-modules and helper functions to increase usability and documentation.__
 
-1. An interactive dashboard that will make the findings of complex ML models accessible to non-technical audiences.
-2. Open Source code that other NHS trusts will be able to use for analyzing their own patient feedback records.
+## Pipeline
 
-## Technical
-A few avenues are currently explored with `R` and `Python`:
+The pipeline is built with `Python`'s [`Scikit-learn`](https://scikit-learn.org/stable/index.html) (Pedregosa et al., 2011) with Machine Learning models that are able to efficiently handle large sparse matrices ("bag-of-words" approach). The pipeline performs a random grid search ([`RandomizedSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)) to identify the best-performing learner and (hyper)parameter values. The process also involves a few pre- and post-fitting steps:
 
-1. Python's [`scikit-learn`](https://scikit-learn.org/stable/index.html). At this point, the script benchmarks ML models able to efficiently handle large sparse matrices ("bag-of-words" approach). We intend to expand this to other approaches, e.g. [`Keras`](https://keras.io/)/[`BERT`](https://pypi.org/project/keras-bert/), [`spaCy`](https://spacy.io/), [`TextBlob`](https://textblob.readthedocs.io/en/dev/quickstart.html#words-inflection-and-lemmatization) etc.
-2. Benchmarking of different algorithms with R package [`mlr3`](https://github.com/mlr-org]).
-3. Facebook's [StarSpace](https://github.com/facebookresearch/StarSpace) with R package [`ruimtehol`](https://github.com/bnosac/ruimtehol).
-4. [`Quanteda`'s](https://quanteda.io/index.html) implementation of Multinomial Naive Bayes (https://tutorials.quanteda.io/machine-learning/nb/).
+1. Data load and split into training and test sets ([`factory_data_load_and_split.py`](https://github.com/CDU-data-science-team/positive_about_change_text_mining/blob/develop/factories/factory_data_load_and_split.py)).
 
-The data is [here](https://github.com/ChrisBeeley/naturallanguageprocessing/blob/master/cleanData.Rdata) and that's where it will stay until GitHub stops crashing when I try to upload them to this project!
+2. Text pre-processing (e.g. remove special characters, whitespaces and line breaks) and tokenization, token lemmatization, calculation of Term Frequency–Inverse Document Frequencies (TF-IDFs), up-balancing of rare classes, feature selection, pipeline training and learner benchmarking ([`factory_pipeline.py`](https://github.com/CDU-data-science-team/positive_about_change_text_mining/blob/develop/factories/factory_pipeline.py)).
 
-### Preliminary findings
+3. Evaluation of pipeline performance on test set, production of evaluation metrics (Accuracy score, [Class Balance Accuracy](https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=4544&context=etd) (Mosley, 2013), [Balanced Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html) (Guyon et al., 2015, Kelleher et al., 2015) or [Matthews Correlation Coefficient](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html) (Baldi et al., 2000, Matthews, 1975)) and plots, and fitting of best performer on whole dataset ([`factory_model_performance.py`](https://github.com/CDU-data-science-team/positive_about_change_text_mining/blob/develop/factories/factory_model_performance.py)).
 
-The learners in `Python` are immensely more efficient than their `R` counterparts and building pipelines with `scikit-learn` is pretty straightforward. Moreover, `Python` offers a much wider range of options for text preprocessing and mining.
+4. Writing the results: fitted pipeline, tuning results, predictions, accuracy per class, model comparison bar plot, training data index, and test data index ([`factory_write_results.py`](https://github.com/CDU-data-science-team/positive_about_change_text_mining/blob/develop/factories/factory_write_results.py)).
 
-##### A first pipeline
-For a starter, we built a simple pipeline with learners that can efficiently handle large sparse matrices ("bag-of-words" approach). We used an exhaustive grid search ([`GridSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)) to assess the impact of different (hyper)parameter combinations on model performance. (For efficiency, we switched to [`RandomizedSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) later on.) 
+There are a few helper functions and classes available in the [helpers](https://github.com/CDU-data-science-team/positive_about_change_text_mining/tree/develop/helpers) folder that the aforementioned factories make use of.
 
-The pipeline (as well as subsequent pipelines) does some preprocessing for text tokenization/lemmatization, word frequencies etc. and benchmarks learners with a 5-fold cross-validation and an appropriate score for imbalanced datasets ([Class Balance Accuracy](https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=4544&context=etd), [Balanced Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html) or [Matthews Correlation Coefficient](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html)) or with the standard Accuracy score (classes correctly predicted over number of records). Fitting the pipeline using Class Balance Accuracy as the scorer, we find that the best model is a Linear SVC classifier:
+The factories are brought together in a single function [`text_classification_pipeline.py`](https://github.com/CDU-data-science-team/positive_about_change_text_mining/tree/develop/pipelines) that runs the whole process. This function can be run in a user-made script such as [`test.py`](https://github.com/CDU-data-science-team/positive_about_change_text_mining/tree/develop/execution). The text dataset is loaded either as CSV from folder [datasets](https://github.com/CDU-data-science-team/positive_about_change_text_mining/tree/develop/datasets) or is loaded directly from the database. The former practice is _not_ recommended, because `Excel` can cause all sorts of issues with text encodings. The [results](https://github.com/CDU-data-science-team/positive_about_change_text_mining/tree/develop/results) folder always contains a SAV of the fitted model and a PNG of the learner comparison bar plot. Results in tabular form are typically saved in the database, unless the user chooses to write them as CSV files in the "results" folder. All results files have a "_target_variable_name" suffix, for example "tuning_results_label" if the dependent variable is `label`.
 
-![](p_compare_models_bar_first_pipeline.png)
+Here is a visual display of the process:
 
-The optimal (hyper)parameter values for the best model and rest of learners, as well as other metrics (fit time, scores per cross-validation fold etc.) are in [tuning_results_first_pipeline.csv](https://github.com/CDU-data-science-team/positive_about_change_text_mining/blob/master/tuning_results_first_pipeline.csv).
+![](text_classification_package_structure.png)
 
-**A few things to note:**
+## More to come...
+More to come...
 
-1. We used a custom tokenizer in [`TfidfVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) that allows the user to choose between `spaCy` and `Wordnet` (see [`NLTK`](https://www.nltk.org/)) for tokenization and lemmatization. The algorithm of `spaCy` is faster and also led to marginally higher classifier performances. Therefore, there is no reason to have the pipeline switch between `spaCy` and `Wordnet`, so future imrpovements to the pipeline will have `spaCy`'s tokenizer/lemmatizer as the default.
-2. More often than not, we ran into convergence issues with [`LinearSVC()`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)- even with `max_iter=10000`. There is an ongoing discussion [here](https://github.com/scikit-learn/scikit-learn/issues/11536) (see _hermidalc_'s comment on 20 April 2020). As a safety measure, we will not be considering this learner in subsequent runs.
-3. We may need to reconsider the performance metric. Unlike scorers that account for class imbalances, the Accuracy score is simple and easily communicated. The downside is that it may be inflated by a few classes that the model predicts correctly most of the time. But it is perfectly suited to situations where the aim is to correctly predict the tags for as many feedback records as possible, regardless of their tag. We could combine this scorer with _human-in-the-loop_ ML. Something to think about.
+## References
+Baldi P., Brunak S., Chauvin Y., Andersen C.A.F. & Nielsen H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. _Bioinformatics_  16(5):412–424.
 
-#### Improving the pipeline
+Guyon I., Bennett K. Cawley G., Escalante H.J., Escalera S., Ho T.K., Macià N., Ray B., Saeed M., Statnikov A.R, & Viegas E. (2015). [Design of the 2015 ChaLearn AutoML Challenge](https://ieeexplore.ieee.org/document/7280767), International Joint Conference on Neural Networks (IJCNN).
 
-A major disadvantage of the grid search ([`GridSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)) approach for pipeline tuning used in the first pipeline is that it takes hours to fit, without necessarily resulting in a model with notably better performance than the rest.
+Kelleher J.D., Mac Namee B. & D’Arcy A.(2015). [Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies](https://mitpress.mit.edu/books/fundamentals-machine-learning-predictive-data-analytics). MIT Press.
 
-We  therefore switched to a random (or randomized) search ([`RandomizedSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)). With random search, the algorithm chooses randomly (hyper)parameter combinations. The number of random combinations chosen is set by the user. This significantly reduces tuning time with minimal impact on model performance. See a comparison of grid search and random search [here](https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py) and [here](https://jmlr.csail.mit.edu/papers/v13/bergstra12a.html).
+Matthews B.W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. _Biochimica et Biophysica Acta (BBA) - Protein Structure_ 405(2):442–451.
 
-Here are the results for a random search with 100 repetitions, using `spaCy` for tokenization and lemmatization, and with `LinearSVC()` switched off ([see earlier comments](#a-first-pipeline)):
-
-![](p_compare_models_bar_second_pipeline.png).
-
-Performance metrics for the optimal and all other learners are [here](https://github.com/CDU-data-science-team/positive_about_change_text_mining/blob/andreas/tuning_results_class_balance_accuracy_second_pipeline.csv).
-
-#### `R` is not a good option
-We soon concluded that building the ML pipelines in `R` would be incomplete and inefficient:
-
-1. **`mlr3`.** The text tokenizer pipe operator in `mlr3pipelines` is [slow](https://github.com/mlr-org/mlr3pipelines/issues/511) and the readily available models in `mlr3learners` and `mlr3extralearners` are very inefficient with sparse matrices.
-2. **`ruimtehol`.** The accuracy of the StarSpace model does not exceed 59% in both superviser and semi-supervised settings.
-3. **`Quanteda`.** The Multinomial Naive Bayes model in `quanteda.textmodels` is extremely fast. However, `Quanteda` does not (yet?) offer options for a fully automated pipeline that would deal with issues such as train-test data leakage etc.
-
-Therefore, the `R` scripts are, and will probably remain, experimental, so don't be surprised if the chunks of the code contain errors, are cryptic or don't work at all, or if the models aren't appropriate or don't perform _that_ great.
-
-Further details in the Appendix.
-
-## Appendix
-### `mlr3` (R)
-The pipeline performs data pre-processing (e.g. one-hot encode dates, if a date column is used; tokenize text and get word frequencies; etc.) and then benchmarks a number of classification algorithms. Five algorithms were considered:
-
-| Model                                                 | Issues      | Verdict     |
-| :-------------                                        | :---------- | ----------- |
-| Generalized Linear Models with Elastic Net (GLM NET) | Something goes wrong inside the pipeline. It seems like it gets confused because GLM NET drops out irrelevant features during training, so the pipeline throws an error when it finds these variables in the test set.  | Investigate issue and consider implementing the model.    |
-| Naive Bayes | `mlr3learners` implements `e1071::naiveBayes` which can be [terribly slow](https://stackoverflow.com/questions/54427001/naive-bayes-in-quanteda-vs-caret-wildly-different-results) with sparse data like text data. I may try to add `quanteda.textmodels::textmodel_nb` to `mlr3extralearners`, because it is a freakishly fast multinomial Naive Bayes model that is designed for text data.  | Don't implement, **unless** I manage to add `quanteda.textmodels::textmodel_nb` to `mlr3extralearners`. Alternatively, some fast implementation of a multinomial or kernel-based Naive Bayes model may be a reasonable alternative? |
-| Random Forest | It appears that it the use of Random Forest with sparse data can be problematic. See [this video](https://www.youtube.com/watch?v=Sz8RB_fPYOk) (54' 10'') and [this resource](https://stats.stackexchange.com/questions/28828/is-there-a-random-forest-implementation-that-works-well-with-very-sparse-data).   | Don't implement.    |
-| XGBoost | The most popular boosted tree algorithm nowadays. Can handle sparse data.   | Implement.    |
-
-The script that runs the whole process (from data loading and prep to model benchmarking and results evaluation) is `mlr3_run_pipeline.R` and consists of four lines of code. Run each line of this code individually to familiarize yourselves with the process.
-
-As a starter, the answers to the prompts in `mlr3_prepare_test_and_training_tasks.R` should be as follows:
-
-1. pipeline_data
-2. nfspf
-3. super
-4. 0.67
-
-The answers to the prompts in `mlr3_pipeline_optimal_defaults.R` should be as follows:
-
-1. cv
-2. 2
-3. classif.mbrier
-4. 3
-
-You can always change these values, but note that more CV folds and evaluations would mean more computation time and memory usage.
-
-### StarSpace `ruimtehol` (R)
-As a starter, script `starspace.R` prepares the data in the appropriate format and builds a simple supervised model from which embeddings and other useful information (e.g. word clouds for each tag) can be extracted. The script also produces a rough model accuracy metric with the test data, as well as a T-SNE plot to visually assess how well the model performs on unseen data.
+Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), [Scikit-learn: Machine Learning in Python](https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html). _Journal of Machine Learning Research_ 12:2825–2830
diff --git a/text_classification_package_structure.png b/text_classification_package_structure.png