Here I attempt to reproduce results from A first perturbome of Pseudomonas aeruginosa: Identification of core genes related to multiple perturbations by a machine learning approach, using the Tidymodels framework instead of caret.
Original code of the paper can be found here: Molina-Mora et al., 2021
For this paper, three models were built for identification of top genes: a Random Forest, a Support Vector Machine and a K-nearest neighbor.
Rather than caret::varImp, Tidymodels tipically relies on the vip package to calculate variable importance. vip allows calculating model-specific feature importance, which was used for the random forest model.
For models such as the SVM and KNN vip also allows model agnostic calculations. Here I used FIRM based on Greenwell et al (2018).
The large number of features in this dataset may turn this calculations slow.
- accuracy: 0.773
- roc_auc: 0.906 SVM seems more likely to predict "Perturbation".
- accuracy: 0.773
- roc_auc: 0.739
The multiple partition method is executed with classid_ind_tidy.R. As of right now it only uses de Random Forest model with 6 replicas (paper uses 100 replicas).