diff --git a/joss.05996/10.21105.joss.05996.jats b/joss.05996/10.21105.joss.05996.jats new file mode 100644 index 0000000000..4fa056f5d7 --- /dev/null +++ b/joss.05996/10.21105.joss.05996.jats @@ -0,0 +1,1088 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +5996 +10.21105/joss.05996 + +Machine Learning Validation via Rational Dataset Sampling +with astartes + + + +https://orcid.org/0000-0002-0657-9426 + +Burns +Jackson W. + + + +* + + +https://orcid.org/0000-0002-9484-9253 + +Spiekermann +Kevin A. + + + + +https://orcid.org/0000-0002-6598-3939 + +Bhattacharjee +Himaghna + + + + +https://orcid.org/0000-0002-6795-8403 + +Vlachos +Dionisios G. + + + + +https://orcid.org/0000-0003-2603-9694 + +Green +William H. + + + + + +Center for Computational Science and Engineering, +Massachusetts Institute of Technology + + + + +Department of Chemical Engineering, Massachusetts Institute +of Technology, United States + + + + +Department of Chemical and Biomolecular Engineering, +University of Delaware, United States + + + + +* E-mail: + + +3 +4 +2023 + +8 +91 +5996 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +Python +machine learning +sampling +interpolation +extrapolation +data splits +cheminformatics + + + + + + Summary +

Machine Learning (ML) has become an increasingly popular tool to + accelerate traditional workflows. Critical to the use of ML is the + process of splitting datasets into training, validation, and testing + subsets that are used to develop and evaluate models. Common practice + in the literature is to assign these subsets randomly. Although this + approach is fast and efficient, it only measures a model’s capacity to + interpolate. Testing errors from random splits may be overly + optimistic if given new data that is dissimilar to the scope of the + training set; thus, there is a growing need to easily measure + performance for extrapolation tasks. To address this issue, we report + astartes, an open-source Python package that + implements many similarity- and distance-based algorithms to partition + data into more challenging splits. Separate from + astartes, users can then use these splits to + better assess out-of-sample performance with any ML model of choice. + This publication focuses on use-cases within cheminformatics. However, + astartes operates on arbitrary vector inputs, + so its principals and workflow are generalizable to other ML domains + as well. astartes is available via the Python + package managers pip and + conda and is publicly hosted on GitHub + (github.com/JacksonBurns/astartes).

+
+ + Statement of Need +

Machine learning has sparked an explosion of progress in chemical + kinetics + (Komp + et al., 2022; + Spiekermann + et al., 2022a), drug discovery + (Bannigan + et al., 2021; + X. + Yang et al., 2019), materials science + (Wei + et al., 2019), and energy storage + (Jha + et al., 2023) as researchers use data-driven methods to + accelerate steps in traditional workflows within some acceptable error + tolerance. To facilitate adoption of these models, researchers must + critically think about several topics, such as comparing model + performance to relevant baselines, operating on user-friendly inputs, + and reporting performance on both interpolative and extrapolative + tasks Spiekermann, Stuyver, et al. + (2023). + astartes aims to make it straightforward for + machine learning scientists and researchers to focus on two important + points: rigorous hyperparameter optimization and accurate performance + evaluation.

+

First, astartes’ key function + train_val_test_split returns splits for + training, validation, and testing sets using an + sklearn-like interface. These splits can then + separately be used with any chosen ML model. This partitioning is + crucial since best practices in data science dictate that, in order to + minimize the risk of hyperparameter overfitting, one must only + optimize hyperparameters with a validation set and use a held-out test + set to accurately measure performance on unseen data + (Géron, + 2019; + Huyen, + 2022; + Lakshmanan + et al., 2020; + Ramsundar + et al., 2019; + Wang + et al., 2020). Unfortunately, many published papers only + mention training and testing sets but do not mention validation sets, + implying that they optimize the hyperparameters to the test set, which + would be blatant data leakage that leads to overly optimistic results. + For researchers interested in quickly obtaining preliminary results + without using a validation set to optimize hyperparameters, + astartes also implements an + sklearn-compatible + train_test_split function.

+

Second, it is crucial to evaluate model performance in both + interpolation and extrapolation settings so future users are informed + of any potential limitations. Although random splits are frequently + used in the cheminformatics literature, this simply measures + interpolation performance. However, given the vastness of chemical + space + (Ruddigkeit + et al., 2012) and its often unsmooth nature (e.g. activity + cliffs), it seems unlikely that users will want to be restricted to + exclusively operate in an interpolation regime. Thus, to encourage + adoption of these models, it is crucial to measure performance on more + challenging splits as well. The general workflow is: 1. Convert each + molecule into a vector representation. 2. Cluster the molecules based + on similarity. 3. Train the model on some clusters and then evaluate + performance on unseen clusters that should be dissimilar to the + clusters used for training. Although measuring performance on + chemically dissimilar compounds/clusters is not a new concept + (Bilodeau + et al., 2023; + Durdy + et al., 2022; + Heinen + et al., 2021; + Jorner + et al., 2021; + Meredig + et al., 2018; + Stuyver + & Coley, 2022; + Terrones + et al., 2023; + Tricarico + et al., 2022), there are a myriad of choices for the first two + steps; our software incorporates many popular representations and + similarity metrics to give users freedom to easily explore which + combination is suitable for their needs.

+
+ + Example Use-Case in Cheminformatics +

To demonstrate the difference in performance between interpolation + and extrapolation, astartes is used to generate + interpolative and extrapolative data splits for two relevant + cheminformatics datasets. The impact of these data splits on model + performance could be analyzed with any ML model. Here, we train a + modified version of Chemprop + (K. + Yang et al., 2019)–a deep message passing neural network–to + predict the regression targets of interest. We use the hyperparameters + reported by Spiekermann et al. + (2022a) + as implemented in the barrier_prediction + branch, which is publicly available on + GitHub + (Spiekermann, + Pattanaik, et al., 2023). First is property prediction with QM9 + (Ramakrishnan + et al., 2014), a dataset containing approximately 133,000 small + organic molecules, each containing 12 relevant chemical properties + calculated at B3LYP/6-31G(2df,p). We train a multi-task model to + predict all properties, with the arithmetic mean of all predictions + tabulated below. Second is a single-task model to predict a reaction’s + barrier height using the RDB7 dataset + (Spiekermann + et al., 2022b, + 2022c). + This reaction database contains a diverse set of 12,000 organic + reactions calculated at CCSD(T)-F12 that is relevant to the field of + chemical kinetics.

+

For each dataset, a typical interpolative split is generated using + random sampling. We also create two extrapolative splits for + comparison. The first uses the cheminformatics-specific Bemis-Murcko + scaffold + (Bemis + & Murcko, 1996) as calculated by RDKit + (Landrum + & others, 2006). The second uses the more general-purpose + K-means clustering based on the Euclidean distance of Morgan (ECFP4) + fingerprints using 2048 bit hashing and radius of 2 + (Morgan, + 1965; + Rogers + & Hahn, 2010). The QM9 dataset and RDB7 datasets were + organized into 100 and 20 clusters, respectively. For each split, we + create 5 different folds (by changing the random seed) and report the + mean + + ± + one standard deviation of the mean absolute error (MAE) and + root-mean-squared error (RMSE).

+ + Table 1: Average testing errors for predicting the 12 + regression targets from QM9 + (<xref alt="Ramakrishnan et al., 2014" rid="ref-ramakrishnan2014quantum" ref-type="bibr">Ramakrishnan + et al., 2014</xref>). + + + + + + + + + + + + + + + + + + + + + + + + + + +
SplitMAERMSE
Random2.02 + + ± + 0.063.63 + + ± + 0.21
Scaffold2.20 + + ± + 0.273.46 + + ± + 0.49
K-means2.48 + + ± + 0.334.47 + + ± + 0.81
+
+
+ + Table 2: Testing errors in kcal/mol for predicting a + reaction’s barrier height from RDB7 + (<xref alt="Spiekermann et al., 2022b" rid="ref-spiekermann2022high" ref-type="bibr">Spiekermann + et al., 2022b</xref>). + + + + + + + + + + + + + + + + + + + + + + + + + + +
SplitMAERMSE
Random3.87 + + ± + 0.056.81 + + ± + 0.28
Scaffold6.28 + + ± + 0.439.49 + + ± + 0.50
K-means5.47 + + ± + 1.148.77 + + ± + 1.85
+
+

Table 1 and Table 2 show the expected trend in which the average + testing errors are higher for the extrapolation tasks than they are + for the interpolation task. The results from random splitting are + informative if the model is primarily used in interpolation + settings. However, these errors are likely unrealistically low if + the model is intended to make predictions on new molecules that are + chemically dissimilar to those in the training set. Performance is + worse on the extrapolative data splits, which present a more + challenging task, but these errors should be more representative of + evaluating a new sample that is out-of-scope. Together, these tables + demonstrate the utility of astartes in + allowing users to better understand the likely performance of their + model in different settings.

+

Several approaches could be taken to further reduce the errors + presented here. One could pre-train on additional data or fine-tune + with experimental values. Ensembling is another established method + to improve model predictions.

+
+
+ + Related Software and Code Availability +

In the machine learning space, astartes + functions as a drop-in replacement for the ubiquitous + train_test_split from scikit-learn + (Pedregosa + et al., 2011). Transitioning existing code to use this new + methodology is as simple as running + pip install astartes, modifying an + import statement at the top of the file, and + then specifying an additional keyword parameter. + astartes has been especially designed to allow + for maximum interoperability with other packages, using few + dependencies, supporting all platforms, and validated support for + Python 3.7 through 3.11. Specific tutorials on this transition are + provided in the online documentation for + astartes, which is available on + GitHub.

+

Here is an example workflow using + train_test_split taken from the + scikit-learn documentation + (Pedregosa + et al., 2011):

+ import numpy as np +from sklearn.model_selection import train_test_split + +X, y = np.arange(10).reshape((5, 2)), range(5) + +X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.33, random_state=42) +

To switch to using astartes, + from sklearn.model_selection import train_test_split + becomes from astartes import train_test_split + and the call to split the data is nearly identical and simple in the + extensions that it provides:

+ import numpy as np +from astartes import train_test_split + +X, y = np.arange(10).reshape((5, 2)), range(5) + +X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.33, sampler="kmeans", random_state=42) +

With this small change, an extrapolative sampler based on k-means + clustering will be used.

+

Inside cheminformatics, astartes makes use + of all molecular featurization options implemented in + AIMSim + (Bhattacharjee + et al., 2023), which includes those from virtually all popular + descriptor generation tools used in the cheminformatics field.

+

The codebase itself has a clearly defined contribution guideline + and thorough, easily accessible documentation. + astartes uses GitHub actions for Constant + Integration testing including unit tests, functional tests, and + regression tests. To emphasize the reliability and reproducibility of + astartes, the data splits used to generate + Table 1 and Table 2 are included in the regression tests. Test + coverage currently sits at >99%, and all proposed changes are + subjected to a coverage check and merged only if they cover all + existing and new lines added as well as satisfy the regression + tests.

+
+ + Acknowledgements +

The authors thank all users who participated in beta testing and + release candidate testing throughout the development of + astartes. Authors Kevin Spiekermann and William + Green gratefully acknowledge financial support from BASF under award + number 88803720. Authors Jackson Burns and William Green gratefully + acknowledge financial support from the U.S. Department of Energy, + Office of Science, Office of Advanced Scientific Computing Research, + Department of Energy Computational Science Graduate Fellowship under + Award Number DE-SC0023112. Authors Himaghna Bhattacharjee and + Dionisios Vlachos contribution was primarily supported by the National + Science Foundation under Grant No. 2134471

+
+ + Disclaimer +

This report was prepared as an account of work sponsored by an + agency of the United States Government. Neither the United States + Government nor any agency thereof, nor any of their employees, makes + any warranty, express or implied, or assumes any legal liability or + responsibility for the accuracy, completeness, or usefulness of any + information, apparatus, product, or process disclosed, or represents + that its use would not infringe privately owned rights. Reference + herein to any specific commercial product, process, or service by + trade name, trademark, manufacturer, or otherwise does not necessarily + constitute or imply its endorsement, recommendation, or favoring by + the United States Government or any agency thereof. The views and + opinions of authors expressed herein do not necessarily state or + reflect those of the United States Government or any agency + thereof.

+
+ + + + + + + PedregosaF. + VaroquauxG. + GramfortA. + MichelV. + ThirionB. + GriselO. + BlondelM. + PrettenhoferP. + WeissR. + DubourgV. + VanderplasJ. + PassosA. + CournapeauD. + BrucherM. + PerrotM. + DuchesnayE. + + Scikit-learn: Machine learning in Python + Journal of Machine Learning Research + 2011 + 12 + 2825 + 2830 + + + + + + GéronAurélien + + Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems + O’Reilly Media, Inc. + 2019 + + + + + + RamsundarBharath + EastmanPeter + WaltersPatrick + PandeVijay + + Deep learning for the life sciences: Applying deep learning to genomics, microscopy, drug discovery, and more + O’Reilly Media, Inc. + 2019 + + + + + + LakshmananValliappa + RobinsonSara + MunnMichael + + Machine learning design patterns: Solutions to common challenges in data preparation, model building, and MLOps + O’Reilly Media, Inc. + 2020 + + + + + + HuyenChip + + Designing machine learning systems: An iterative process for production-ready applications + O’Reilly Media, Inc. + 2022 + + + + + + WangAnthony Yu-Tung + MurdockRyan J. + KauweSteven K. + OliynykAnton O. + GurloAleksander + BrgochJakoah + PerssonKristin A. + SparksTaylor D. + + Machine learning for materials scientists: An introductory guide toward best practices + Chemistry of Materials + ACS Publications + 2020 + 32 + 12 + 10.1021/acs.chemmater.0c01907.s001 + 4954 + 4965 + + + + + + SpiekermannKevin A. + StuyverThijs + PattanaikLagnajit + GreenWilliam H. + + Comment on ‘physics-based representations for machine learning properties of chemical reactions’ + Machine Learning: Science & Technology + IOP Publishing + 2023 + 4 + 4 + 048001 + + + + + + + RamakrishnanRaghunathan + DralPavlo O. + RuppMatthias + LilienfeldO. Anatole von + + Quantum Chemistry Structures and Properties of 134 Kilo Molecules + Scientific Data + Nature Publishing Group + 2014 + 1 + 1 + 10.1038/sdata.2014.22 + 1 + 7 + + + + + + RuddigkeitLars + Van DeursenRuud + BlumLorenz C. + ReymondJean-Louis + + Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 + Journal of Chemical Information and Modeling + ACS Publications + 2012 + 52 + 11 + 10.1021/ci300415d + 2864 + 2875 + + + + + + SpiekermannKevin A. + PattanaikLagnajit + GreenWilliam H. + + High Accuracy Barrier Heights, Enthalpies, and Rate Coefficients for Chemical Reactions + Scientific Data + Nature Publishing Group + 2022 + 9 + 1 + 10.1038/s41597-022-01529-6 + 1 + 12 + + + + + + SpiekermannKevin A. + PattanaikLagnajit + GreenWilliam H. + + High accuracy barrier heights, enthalpies, and rate coefficients for chemical reactions + Zenodo + 202204 + https://zenodo.org/record/6618262#.YyXlICHMI0Q + 10.5281/zenodo.6618262 + + + + + + SpiekermannKevin A. + PattanaikLagnajit + GreenWilliam H. + + Fast predictions of reaction barrier heights: Toward coupled-cluster accuracy + The Journal of Physical Chemistry A + ACS Publications + 2022 + 126 + 25 + 10.1021/acs.jpca.2c02614 + 3976 + 3986 + + + + + + SpiekermannKevin A. + PattanaikLagnajit + GreenWilliam H. + YangKevin + SwansonKyle + JinWengong + ColeyConnor + EidenPhilipp + GaoHua + Guzman-PerezAngel + HopperTimothy + KelleyBrian + MatheaMiriam + others + + 202302 + https://github.com/kspieks/chemprop/tree/barrier_prediction + + + + + + YangXin + WangYifei + ByrneRyan + SchneiderGisbert + YangShengyong + + Concepts of artificial intelligence for computer-assisted drug discovery + Chemical Reviews + ACS Publications + 2019 + 119 + 18 + 10520 + 10594 + + + + + + BanniganPauric + AldeghiMatteo + BaoZeqing + HäseFlorian + Aspuru-GuzikAlan + AllenChristine + + Machine learning directed drug formulation development + Advanced Drug Delivery Reviews + Elsevier + 2021 + 175 + 113806 + + + + + + + JhaSwarn + YenMatthew + SalinasYazmin + PalmerEvan + VillafuerteJohn + LiangHong + + Learning-assisted materials development and device management in batteries and supercapacitors: Performance comparison and challenges + Journal of Materials Chemistry A + Royal Society of Chemistry + 2023 + 11 + 3904 + 3936 + + + + + + KompEvan + JanulaitisNida + ValleauStéphanie + + Progress Towards Machine Learning Reaction Rate Constants + Physical Chemistry Chemical Physics + Royal Society of Chemistry + 2022 + 24 + 10.1039/d1cp04422b + 2692 + 2705 + + + + + + WeiJing + ChuXuan + SunXiang-Yu + XuKun + DengHui-Xiong + ChenJigen + WeiZhongming + LeiMing + + Machine learning in materials science + InfoMat + Wiley Online Library + 2019 + 1 + 3 + 338 + 358 + + + + + + MeredigBryce + AntonoErin + ChurchCarena + HutchinsonMaxwell + LingJulia + ParadisoSean + BlaiszikBen + FosterIan + GibbonsBrenna + Hattrick-SimpersJason + MehtaApurva + WardLogan + + Can machine learning identify the next high-temperature superconductor? Examining extrapolation performance for materials discovery + Molecular Systems Design & Engineering + Royal Society of Chemistry + 2018 + 3 + 5 + 10.1039/d1cp04422b + 819 + 825 + + + + + + DurdySamantha + GaultoisMichael W. + GusevVladimir V. + BollegalaDanushka + RosseinskyMatthew J. + + Random projections and kernelised leave one cluster out cross validation: Universal baselines and evaluation tools for supervised machine learning of material properties + Digital Discovery + Royal Society of Chemistry + 2022 + 1 + 10.1039/d2dd00039c + 763 + 778 + + + + + + TricaricoGiovanni A. + HofmansJohan + LenselinkEelke B. + RamosMiriam López + DréanicMarie-Pierre + StoutenPieter FW + + Construction of balanced, chemically dissimilar training, validation and test sets for machine learning on molecular datasets + 10.26434/chemrxiv-2022-m8l33 + 2022 + 10.26434/chemrxiv-2022-m8l33-v2 + + + + + + TerronesGianmarco G. + DuanChenru + NandyAditya + KulikHeather J. + + Low-cost machine learning prediction of excited state properties of iridium-centered phosphors + Chemical Science + Royal Society of Chemistry + 2023 + 14 + 10.1039/d2sc06150c + 1419 + 1433 + + + + + + StuyverThijs + ColeyConnor W. + + Quantum Chemistry-Augmented Neural Networks for Reactivity Prediction: Performance, Generalizability, and Explainability + The Journal of Chemical Physics + AIP Publishing LLC + 2022 + 156 + 8 + 10.1063/5.0079574 + 084104 + + + + + + + HeinenStefan + RudorffGuido Falk von + LilienfeldO. Anatole von + + Toward the Design of Chemical Reactions: Machine Learning Barriers of Competing Mechanisms in Reactant Space + J. Chem. Phys. + AIP Publishing LLC + 2021 + 155 + 6 + 10.1063/5.0059742 + 064105 + + + + + + + BilodeauCamille + KazakovAndrei + MukhopadhyaySukrit + EmersonJillian + KalantarTom + MuznyChris + JensenKlavs + + Machine learning for predicting the viscosity of binary liquid mixtures + Chem. Eng. J. + Elsevier + 2023 + 10.2139/ssrn.4289793 + 142454 + + + + + + + JornerKjell + BrinckTore + NorrbyPer-Ola + ButtarDavid + + Machine Learning Meets Mechanistic Modelling for Accurate Prediction of Experimental Activation Energies + Chem. Sci. + Royal Society of Chemistry + 2021 + 12 + 3 + 10.26434/chemrxiv.12758498 + 1163 + 1175 + + + + + + LandrumGreg + others + + RDKit: Open-Source Cheminformatics + 2006 + https://www.rdkit.org + + + + + + BemisGuy W. + MurckoMark A. + + The Properties of Known Drugs. 1. Molecular Frameworks + Journal of Medicinal Chemistry + ACS Publications + 1996 + 39 + 15 + 10.1021/jm9602928 + 2887 + 2893 + + + + + + MorganHarry L. + + The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service + Journal of Chemical Documentation + ACS Publications + 1965 + 5 + 2 + 10.1021/c160017a018 + 107 + 113 + + + + + + RogersDavid + HahnMathew + + Extended-connectivity fingerprints + Journal of Chemical Information and Modeling + ACS Publications + 2010 + 50 + 5 + 10.1021/ci100050t + 742 + 754 + + + + + + YangKevin + SwansonKyle + JinWengong + ColeyConnor + EidenPhilipp + GaoHua + Guzman-PerezAngel + HopperTimothy + KelleyBrian + MatheaMiriam + others + + Analyzing Learned Molecular Representations for Property Prediction + Journal of Chemical Information and Modeling + ACS Publications + 2019 + 59 + 8 + 10.1021/acs.jcim.9b00237.s001 + 3370 + 3388 + + + + + + BhattacharjeeHimaghna + BurnsJackson + VlachosDionisios G. + + AIMSim: An accessible cheminformatics platform for similarity operations on chemicals datasets + Computer Physics Communications + 2023 + 283 + 0010-4655 + https://www.sciencedirect.com/science/article/pii/S0010465522002983 + 10.1016/j.cpc.2022.108579 + 108579 + + + + + +