diff --git a/joss.05996/10.21105.joss.05996.crossref.xml b/joss.05996/10.21105.joss.05996.crossref.xml
new file mode 100644
index 0000000000..936065b007
--- /dev/null
+++ b/joss.05996/10.21105.joss.05996.crossref.xml
@@ -0,0 +1,564 @@
+
+
+
+ 20231105T180902-9e167b10c302914e5b3bc5fa805b3356fc3dd273
+ 20231105180902
+
+ JOSS Admin
+ admin@theoj.org
+
+ The Open Journal
+
+
+
+
+ Journal of Open Source Software
+ JOSS
+ 2475-9066
+
+ 10.21105/joss
+ https://joss.theoj.org
+
+
+
+
+ 11
+ 2023
+
+
+ 8
+
+ 91
+
+
+
+ Machine Learning Validation via Rational Dataset
+Sampling with astartes
+
+
+
+ Jackson W.
+ Burns
+ https://orcid.org/0000-0002-0657-9426
+
+
+ Kevin A.
+ Spiekermann
+ https://orcid.org/0000-0002-9484-9253
+
+
+ Himaghna
+ Bhattacharjee
+ https://orcid.org/0000-0002-6598-3939
+
+
+ Dionisios G.
+ Vlachos
+ https://orcid.org/0000-0002-6795-8403
+
+
+ William H.
+ Green
+ https://orcid.org/0000-0003-2603-9694
+
+
+
+ 11
+ 05
+ 2023
+
+
+ 5996
+
+
+ 10.21105/joss.05996
+
+
+ http://creativecommons.org/licenses/by/4.0/
+ http://creativecommons.org/licenses/by/4.0/
+ http://creativecommons.org/licenses/by/4.0/
+
+
+
+ Software archive
+ 10.5281/zenodo.8147205
+
+
+ GitHub review issue
+ https://github.com/openjournals/joss-reviews/issues/5996
+
+
+
+ 10.21105/joss.05996
+ https://joss.theoj.org/papers/10.21105/joss.05996
+
+
+ https://joss.theoj.org/papers/10.21105/joss.05996.pdf
+
+
+
+
+
+ Scikit-learn: Machine learning in
+Python
+ Pedregosa
+ Journal of Machine Learning
+Research
+ 12
+ 2011
+ Pedregosa, F., Varoquaux, G.,
+Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
+Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
+Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011).
+Scikit-learn: Machine learning in Python. Journal of Machine Learning
+Research, 12, 2825–2830.
+
+
+ Hands-On Machine Learning with Scikit-Learn,
+Keras, and TensorFlow: Concepts, Tools, and Techniques to Build
+Intelligent Systems
+ Géron
+ 2019
+ Géron, A. (2019). Hands-On Machine
+Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and
+Techniques to Build Intelligent Systems. O’Reilly Media,
+Inc.
+
+
+ Deep learning for the life sciences: Applying
+deep learning to genomics, microscopy, drug discovery, and
+more
+ Ramsundar
+ 2019
+ Ramsundar, B., Eastman, P., Walters,
+P., & Pande, V. (2019). Deep learning for the life sciences:
+Applying deep learning to genomics, microscopy, drug discovery, and
+more. O’Reilly Media, Inc.
+
+
+ Machine learning design patterns: Solutions to
+common challenges in data preparation, model building, and
+MLOps
+ Lakshmanan
+ 2020
+ Lakshmanan, V., Robinson, S., &
+Munn, M. (2020). Machine learning design patterns: Solutions to common
+challenges in data preparation, model building, and MLOps. O’Reilly
+Media, Inc.
+
+
+ Designing machine learning systems: An
+iterative process for production-ready applications
+ Huyen
+ 2022
+ Huyen, C. (2022). Designing machine
+learning systems: An iterative process for production-ready
+applications. O’Reilly Media, Inc.
+
+
+ Machine learning for materials scientists: An
+introductory guide toward best practices
+ Wang
+ Chemistry of Materials
+ 12
+ 32
+ 10.1021/acs.chemmater.0c01907.s001
+ 2020
+ Wang, A. Y.-T., Murdock, R. J.,
+Kauwe, S. K., Oliynyk, A. O., Gurlo, A., Brgoch, J., Persson, K. A.,
+& Sparks, T. D. (2020). Machine learning for materials scientists:
+An introductory guide toward best practices. Chemistry of Materials,
+32(12), 4954–4965.
+https://doi.org/10.1021/acs.chemmater.0c01907.s001
+
+
+ Comment on ‘physics-based representations for
+machine learning properties of chemical reactions’
+ Spiekermann
+ Machine Learning: Science &
+Technology
+ 4
+ 4
+ 2023
+ Spiekermann, K. A., Stuyver, T.,
+Pattanaik, L., & Green, W. H. (2023). Comment on “physics-based
+representations for machine learning properties of chemical reactions.”
+Machine Learning: Science & Technology, 4(4),
+048001.
+
+
+ Quantum Chemistry Structures and Properties
+of 134 Kilo Molecules
+ Ramakrishnan
+ Scientific Data
+ 1
+ 1
+ 10.1038/sdata.2014.22
+ 2014
+ Ramakrishnan, R., Dral, P. O., Rupp,
+M., & Lilienfeld, O. A. von. (2014). Quantum Chemistry Structures
+and Properties of 134 Kilo Molecules. Scientific Data, 1(1), 1–7.
+https://doi.org/10.1038/sdata.2014.22
+
+
+ Enumeration of 166 Billion Organic Small
+Molecules in the Chemical Universe Database GDB-17
+ Ruddigkeit
+ Journal of Chemical Information and
+Modeling
+ 11
+ 52
+ 10.1021/ci300415d
+ 2012
+ Ruddigkeit, L., Van Deursen, R.,
+Blum, L. C., & Reymond, J.-L. (2012). Enumeration of 166 Billion
+Organic Small Molecules in the Chemical Universe Database GDB-17.
+Journal of Chemical Information and Modeling, 52(11), 2864–2875.
+https://doi.org/10.1021/ci300415d
+
+
+ High Accuracy Barrier Heights, Enthalpies,
+and Rate Coefficients for Chemical Reactions
+ Spiekermann
+ Scientific Data
+ 1
+ 9
+ 10.1038/s41597-022-01529-6
+ 2022
+ Spiekermann, K. A., Pattanaik, L.,
+& Green, W. H. (2022). High Accuracy Barrier Heights, Enthalpies,
+and Rate Coefficients for Chemical Reactions. Scientific Data, 9(1),
+1–12. https://doi.org/10.1038/s41597-022-01529-6
+
+
+ High accuracy barrier heights, enthalpies,
+and rate coefficients for chemical reactions
+ Spiekermann
+ 10.5281/zenodo.6618262
+ 2022
+ Spiekermann, K. A., Pattanaik, L.,
+& Green, W. H. (2022). High accuracy barrier heights, enthalpies,
+and rate coefficients for chemical reactions (Version 1.0.1). Zenodo.
+https://doi.org/10.5281/zenodo.6618262
+
+
+ Fast predictions of reaction barrier heights:
+Toward coupled-cluster accuracy
+ Spiekermann
+ The Journal of Physical Chemistry
+A
+ 25
+ 126
+ 10.1021/acs.jpca.2c02614
+ 2022
+ Spiekermann, K. A., Pattanaik, L.,
+& Green, W. H. (2022). Fast predictions of reaction barrier heights:
+Toward coupled-cluster accuracy. The Journal of Physical Chemistry A,
+126(25), 3976–3986.
+https://doi.org/10.1021/acs.jpca.2c02614
+
+
+ Spiekermann
+ 2023
+ Spiekermann, K. A., Pattanaik, L.,
+Green, W. H., Yang, K., Swanson, K., Jin, W., Coley, C., Eiden, P., Gao,
+H., Guzman-Perez, A., Hopper, T., Kelley, B., Mathea, M., & others.
+(2023).
+https://github.com/kspieks/chemprop/tree/barrier_prediction
+
+
+ Concepts of artificial intelligence for
+computer-assisted drug discovery
+ Yang
+ Chemical Reviews
+ 18
+ 119
+ 2019
+ Yang, X., Wang, Y., Byrne, R.,
+Schneider, G., & Yang, S. (2019). Concepts of artificial
+intelligence for computer-assisted drug discovery. Chemical Reviews,
+119(18), 10520–10594.
+
+
+ Machine learning directed drug formulation
+development
+ Bannigan
+ Advanced Drug Delivery
+Reviews
+ 175
+ 2021
+ Bannigan, P., Aldeghi, M., Bao, Z.,
+Häse, F., Aspuru-Guzik, A., & Allen, C. (2021). Machine learning
+directed drug formulation development. Advanced Drug Delivery Reviews,
+175, 113806.
+
+
+ Learning-assisted materials development and
+device management in batteries and supercapacitors: Performance
+comparison and challenges
+ Jha
+ Journal of Materials Chemistry
+A
+ 11
+ 2023
+ Jha, S., Yen, M., Salinas, Y.,
+Palmer, E., Villafuerte, J., & Liang, H. (2023). Learning-assisted
+materials development and device management in batteries and
+supercapacitors: Performance comparison and challenges. Journal of
+Materials Chemistry A, 11, 3904–3936.
+
+
+ Progress Towards Machine Learning Reaction
+Rate Constants
+ Komp
+ Physical Chemistry Chemical
+Physics
+ 24
+ 10.1039/d1cp04422b
+ 2022
+ Komp, E., Janulaitis, N., &
+Valleau, S. (2022). Progress Towards Machine Learning Reaction Rate
+Constants. Physical Chemistry Chemical Physics, 24, 2692–2705.
+https://doi.org/10.1039/d1cp04422b
+
+
+ Machine learning in materials
+science
+ Wei
+ InfoMat
+ 3
+ 1
+ 2019
+ Wei, J., Chu, X., Sun, X.-Y., Xu, K.,
+Deng, H.-X., Chen, J., Wei, Z., & Lei, M. (2019). Machine learning
+in materials science. InfoMat, 1(3), 338–358.
+
+
+ Can machine learning identify the next
+high-temperature superconductor? Examining extrapolation performance for
+materials discovery
+ Meredig
+ Molecular Systems Design &
+Engineering
+ 5
+ 3
+ 10.1039/d1cp04422b
+ 2018
+ Meredig, B., Antono, E., Church, C.,
+Hutchinson, M., Ling, J., Paradiso, S., Blaiszik, B., Foster, I.,
+Gibbons, B., Hattrick-Simpers, J., Mehta, A., & Ward, L. (2018). Can
+machine learning identify the next high-temperature superconductor?
+Examining extrapolation performance for materials discovery. Molecular
+Systems Design & Engineering, 3(5), 819–825.
+https://doi.org/10.1039/d1cp04422b
+
+
+ Random projections and kernelised leave one
+cluster out cross validation: Universal baselines and evaluation tools
+for supervised machine learning of material properties
+ Durdy
+ Digital Discovery
+ 1
+ 10.1039/d2dd00039c
+ 2022
+ Durdy, S., Gaultois, M. W., Gusev, V.
+V., Bollegala, D., & Rosseinsky, M. J. (2022). Random projections
+and kernelised leave one cluster out cross validation: Universal
+baselines and evaluation tools for supervised machine learning of
+material properties. Digital Discovery, 1, 763–778.
+https://doi.org/10.1039/d2dd00039c
+
+
+ Construction of balanced, chemically
+dissimilar training, validation and test sets for machine learning on
+molecular datasets
+ Tricarico
+ 10.26434/chemrxiv-2022-m8l33
+ 10.26434/chemrxiv-2022-m8l33-v2
+ 2022
+ Tricarico, G. A., Hofmans, J.,
+Lenselink, E. B., Ramos, M. L., Dréanic, M.-P., & Stouten, P. F.
+(2022). Construction of balanced, chemically dissimilar training,
+validation and test sets for machine learning on molecular datasets.
+10.26434/Chemrxiv-2022-M8l33.
+https://doi.org/10.26434/chemrxiv-2022-m8l33-v2
+
+
+ Low-cost machine learning prediction of
+excited state properties of iridium-centered phosphors
+ Terrones
+ Chemical Science
+ 14
+ 10.1039/d2sc06150c
+ 2023
+ Terrones, G. G., Duan, C., Nandy, A.,
+& Kulik, H. J. (2023). Low-cost machine learning prediction of
+excited state properties of iridium-centered phosphors. Chemical
+Science, 14, 1419–1433.
+https://doi.org/10.1039/d2sc06150c
+
+
+ Quantum Chemistry-Augmented Neural Networks
+for Reactivity Prediction: Performance, Generalizability, and
+Explainability
+ Stuyver
+ The Journal of Chemical
+Physics
+ 8
+ 156
+ 10.1063/5.0079574
+ 2022
+ Stuyver, T., & Coley, C. W.
+(2022). Quantum Chemistry-Augmented Neural Networks for Reactivity
+Prediction: Performance, Generalizability, and Explainability. The
+Journal of Chemical Physics, 156(8), 084104.
+https://doi.org/10.1063/5.0079574
+
+
+ Toward the Design of Chemical Reactions:
+Machine Learning Barriers of Competing Mechanisms in Reactant
+Space
+ Heinen
+ J. Chem. Phys.
+ 6
+ 155
+ 10.1063/5.0059742
+ 2021
+ Heinen, S., Rudorff, G. F. von, &
+Lilienfeld, O. A. von. (2021). Toward the Design of Chemical Reactions:
+Machine Learning Barriers of Competing Mechanisms in Reactant Space. J.
+Chem. Phys., 155(6), 064105.
+https://doi.org/10.1063/5.0059742
+
+
+ Machine learning for predicting the viscosity
+of binary liquid mixtures
+ Bilodeau
+ Chem. Eng. J.
+ 10.2139/ssrn.4289793
+ 2023
+ Bilodeau, C., Kazakov, A.,
+Mukhopadhyay, S., Emerson, J., Kalantar, T., Muzny, C., & Jensen, K.
+(2023). Machine learning for predicting the viscosity of binary liquid
+mixtures. Chem. Eng. J., 142454.
+https://doi.org/10.2139/ssrn.4289793
+
+
+ Machine Learning Meets Mechanistic Modelling
+for Accurate Prediction of Experimental Activation
+Energies
+ Jorner
+ Chem. Sci.
+ 3
+ 12
+ 10.26434/chemrxiv.12758498
+ 2021
+ Jorner, K., Brinck, T., Norrby,
+P.-O., & Buttar, D. (2021). Machine Learning Meets Mechanistic
+Modelling for Accurate Prediction of Experimental Activation Energies.
+Chem. Sci., 12(3), 1163–1175.
+https://doi.org/10.26434/chemrxiv.12758498
+
+
+ RDKit: Open-Source
+Cheminformatics
+ Landrum
+ 2006
+ Landrum, G., & others. (2006).
+RDKit: Open-Source Cheminformatics.
+https://www.rdkit.org
+
+
+ The Properties of Known Drugs. 1. Molecular
+Frameworks
+ Bemis
+ Journal of Medicinal
+Chemistry
+ 15
+ 39
+ 10.1021/jm9602928
+ 1996
+ Bemis, G. W., & Murcko, M. A.
+(1996). The Properties of Known Drugs. 1. Molecular Frameworks. Journal
+of Medicinal Chemistry, 39(15), 2887–2893.
+https://doi.org/10.1021/jm9602928
+
+
+ The generation of a unique machine
+description for chemical structures-a technique developed at chemical
+abstracts service
+ Morgan
+ Journal of Chemical
+Documentation
+ 2
+ 5
+ 10.1021/c160017a018
+ 1965
+ Morgan, H. L. (1965). The generation
+of a unique machine description for chemical structures-a technique
+developed at chemical abstracts service. Journal of Chemical
+Documentation, 5(2), 107–113.
+https://doi.org/10.1021/c160017a018
+
+
+ Extended-connectivity
+fingerprints
+ Rogers
+ Journal of Chemical Information and
+Modeling
+ 5
+ 50
+ 10.1021/ci100050t
+ 2010
+ Rogers, D., & Hahn, M. (2010).
+Extended-connectivity fingerprints. Journal of Chemical Information and
+Modeling, 50(5), 742–754.
+https://doi.org/10.1021/ci100050t
+
+
+ Analyzing Learned Molecular Representations
+for Property Prediction
+ Yang
+ Journal of Chemical Information and
+Modeling
+ 8
+ 59
+ 10.1021/acs.jcim.9b00237.s001
+ 2019
+ Yang, K., Swanson, K., Jin, W.,
+Coley, C., Eiden, P., Gao, H., Guzman-Perez, A., Hopper, T., Kelley, B.,
+Mathea, M., & others. (2019). Analyzing Learned Molecular
+Representations for Property Prediction. Journal of Chemical Information
+and Modeling, 59(8), 3370–3388.
+https://doi.org/10.1021/acs.jcim.9b00237.s001
+
+
+ AIMSim: An accessible cheminformatics
+platform for similarity operations on chemicals datasets
+ Bhattacharjee
+ Computer Physics
+Communications
+ 283
+ 10.1016/j.cpc.2022.108579
+ 0010-4655
+ 2023
+ Bhattacharjee, H., Burns, J., &
+Vlachos, D. G. (2023). AIMSim: An accessible cheminformatics platform
+for similarity operations on chemicals datasets. Computer Physics
+Communications, 283, 108579.
+https://doi.org/10.1016/j.cpc.2022.108579
+
+
+
+
+
+
diff --git a/joss.05996/10.21105.joss.05996.jats b/joss.05996/10.21105.joss.05996.jats
new file mode 100644
index 0000000000..4fa056f5d7
--- /dev/null
+++ b/joss.05996/10.21105.joss.05996.jats
@@ -0,0 +1,1088 @@
+
+
+
+
+
+
+
+Journal of Open Source Software
+JOSS
+
+2475-9066
+
+Open Journals
+
+
+
+5996
+10.21105/joss.05996
+
+Machine Learning Validation via Rational Dataset Sampling
+with astartes
+
+
+
+https://orcid.org/0000-0002-0657-9426
+
+Burns
+Jackson W.
+
+
+
+*
+
+
+https://orcid.org/0000-0002-9484-9253
+
+Spiekermann
+Kevin A.
+
+
+
+
+https://orcid.org/0000-0002-6598-3939
+
+Bhattacharjee
+Himaghna
+
+
+
+
+https://orcid.org/0000-0002-6795-8403
+
+Vlachos
+Dionisios G.
+
+
+
+
+https://orcid.org/0000-0003-2603-9694
+
+Green
+William H.
+
+
+
+
+
+Center for Computational Science and Engineering,
+Massachusetts Institute of Technology
+
+
+
+
+Department of Chemical Engineering, Massachusetts Institute
+of Technology, United States
+
+
+
+
+Department of Chemical and Biomolecular Engineering,
+University of Delaware, United States
+
+
+
+
+* E-mail:
+
+
+3
+4
+2023
+
+8
+91
+5996
+
+Authors of papers retain copyright and release the
+work under a Creative Commons Attribution 4.0 International License (CC
+BY 4.0)
+2022
+The article authors
+
+Authors of papers retain copyright and release the work under
+a Creative Commons Attribution 4.0 International License (CC BY
+4.0)
+
+
+
+Python
+machine learning
+sampling
+interpolation
+extrapolation
+data splits
+cheminformatics
+
+
+
+
+
+ Summary
+
Machine Learning (ML) has become an increasingly popular tool to
+ accelerate traditional workflows. Critical to the use of ML is the
+ process of splitting datasets into training, validation, and testing
+ subsets that are used to develop and evaluate models. Common practice
+ in the literature is to assign these subsets randomly. Although this
+ approach is fast and efficient, it only measures a model’s capacity to
+ interpolate. Testing errors from random splits may be overly
+ optimistic if given new data that is dissimilar to the scope of the
+ training set; thus, there is a growing need to easily measure
+ performance for extrapolation tasks. To address this issue, we report
+ astartes, an open-source Python package that
+ implements many similarity- and distance-based algorithms to partition
+ data into more challenging splits. Separate from
+ astartes, users can then use these splits to
+ better assess out-of-sample performance with any ML model of choice.
+ This publication focuses on use-cases within cheminformatics. However,
+ astartes operates on arbitrary vector inputs,
+ so its principals and workflow are generalizable to other ML domains
+ as well. astartes is available via the Python
+ package managers pip and
+ conda and is publicly hosted on GitHub
+ (github.com/JacksonBurns/astartes).
+
+
+ Statement of Need
+
Machine learning has sparked an explosion of progress in chemical
+ kinetics
+ (Komp
+ et al., 2022;
+ Spiekermann
+ et al., 2022a), drug discovery
+ (Bannigan
+ et al., 2021;
+ X.
+ Yang et al., 2019), materials science
+ (Wei
+ et al., 2019), and energy storage
+ (Jha
+ et al., 2023) as researchers use data-driven methods to
+ accelerate steps in traditional workflows within some acceptable error
+ tolerance. To facilitate adoption of these models, researchers must
+ critically think about several topics, such as comparing model
+ performance to relevant baselines, operating on user-friendly inputs,
+ and reporting performance on both interpolative and extrapolative
+ tasks Spiekermann, Stuyver, et al.
+ (2023).
+ astartes aims to make it straightforward for
+ machine learning scientists and researchers to focus on two important
+ points: rigorous hyperparameter optimization and accurate performance
+ evaluation.
+
First, astartes’ key function
+ train_val_test_split returns splits for
+ training, validation, and testing sets using an
+ sklearn-like interface. These splits can then
+ separately be used with any chosen ML model. This partitioning is
+ crucial since best practices in data science dictate that, in order to
+ minimize the risk of hyperparameter overfitting, one must only
+ optimize hyperparameters with a validation set and use a held-out test
+ set to accurately measure performance on unseen data
+ (Géron,
+ 2019;
+ Huyen,
+ 2022;
+ Lakshmanan
+ et al., 2020;
+ Ramsundar
+ et al., 2019;
+ Wang
+ et al., 2020). Unfortunately, many published papers only
+ mention training and testing sets but do not mention validation sets,
+ implying that they optimize the hyperparameters to the test set, which
+ would be blatant data leakage that leads to overly optimistic results.
+ For researchers interested in quickly obtaining preliminary results
+ without using a validation set to optimize hyperparameters,
+ astartes also implements an
+ sklearn-compatible
+ train_test_split function.
+
Second, it is crucial to evaluate model performance in both
+ interpolation and extrapolation settings so future users are informed
+ of any potential limitations. Although random splits are frequently
+ used in the cheminformatics literature, this simply measures
+ interpolation performance. However, given the vastness of chemical
+ space
+ (Ruddigkeit
+ et al., 2012) and its often unsmooth nature (e.g. activity
+ cliffs), it seems unlikely that users will want to be restricted to
+ exclusively operate in an interpolation regime. Thus, to encourage
+ adoption of these models, it is crucial to measure performance on more
+ challenging splits as well. The general workflow is: 1. Convert each
+ molecule into a vector representation. 2. Cluster the molecules based
+ on similarity. 3. Train the model on some clusters and then evaluate
+ performance on unseen clusters that should be dissimilar to the
+ clusters used for training. Although measuring performance on
+ chemically dissimilar compounds/clusters is not a new concept
+ (Bilodeau
+ et al., 2023;
+ Durdy
+ et al., 2022;
+ Heinen
+ et al., 2021;
+ Jorner
+ et al., 2021;
+ Meredig
+ et al., 2018;
+ Stuyver
+ & Coley, 2022;
+ Terrones
+ et al., 2023;
+ Tricarico
+ et al., 2022), there are a myriad of choices for the first two
+ steps; our software incorporates many popular representations and
+ similarity metrics to give users freedom to easily explore which
+ combination is suitable for their needs.
+
+
+ Example Use-Case in Cheminformatics
+
To demonstrate the difference in performance between interpolation
+ and extrapolation, astartes is used to generate
+ interpolative and extrapolative data splits for two relevant
+ cheminformatics datasets. The impact of these data splits on model
+ performance could be analyzed with any ML model. Here, we train a
+ modified version of Chemprop
+ (K.
+ Yang et al., 2019)–a deep message passing neural network–to
+ predict the regression targets of interest. We use the hyperparameters
+ reported by Spiekermann et al.
+ (2022a)
+ as implemented in the barrier_prediction
+ branch, which is publicly available on
+ GitHub
+ (Spiekermann,
+ Pattanaik, et al., 2023). First is property prediction with QM9
+ (Ramakrishnan
+ et al., 2014), a dataset containing approximately 133,000 small
+ organic molecules, each containing 12 relevant chemical properties
+ calculated at B3LYP/6-31G(2df,p). We train a multi-task model to
+ predict all properties, with the arithmetic mean of all predictions
+ tabulated below. Second is a single-task model to predict a reaction’s
+ barrier height using the RDB7 dataset
+ (Spiekermann
+ et al., 2022b,
+ 2022c).
+ This reaction database contains a diverse set of 12,000 organic
+ reactions calculated at CCSD(T)-F12 that is relevant to the field of
+ chemical kinetics.
+
For each dataset, a typical interpolative split is generated using
+ random sampling. We also create two extrapolative splits for
+ comparison. The first uses the cheminformatics-specific Bemis-Murcko
+ scaffold
+ (Bemis
+ & Murcko, 1996) as calculated by RDKit
+ (Landrum
+ & others, 2006). The second uses the more general-purpose
+ K-means clustering based on the Euclidean distance of Morgan (ECFP4)
+ fingerprints using 2048 bit hashing and radius of 2
+ (Morgan,
+ 1965;
+ Rogers
+ & Hahn, 2010). The QM9 dataset and RDB7 datasets were
+ organized into 100 and 20 clusters, respectively. For each split, we
+ create 5 different folds (by changing the random seed) and report the
+ mean
+
+ ±
+ one standard deviation of the mean absolute error (MAE) and
+ root-mean-squared error (RMSE).
+
+ Table 1: Average testing errors for predicting the 12
+ regression targets from QM9
+ (Ramakrishnan
+ et al., 2014).
+
+
+
+
+
Split
+
MAE
+
RMSE
+
+
+
+
+
Random
+
2.02
+
+ ±
+ 0.06
+
3.63
+
+ ±
+ 0.21
+
+
+
Scaffold
+
2.20
+
+ ±
+ 0.27
+
3.46
+
+ ±
+ 0.49
+
+
+
K-means
+
2.48
+
+ ±
+ 0.33
+
4.47
+
+ ±
+ 0.81
+
+
+
+
+
+
+ Table 2: Testing errors in kcal/mol for predicting a
+ reaction’s barrier height from RDB7
+ (Spiekermann
+ et al., 2022b).
+
+
+
+
+
Split
+
MAE
+
RMSE
+
+
+
+
+
Random
+
3.87
+
+ ±
+ 0.05
+
6.81
+
+ ±
+ 0.28
+
+
+
Scaffold
+
6.28
+
+ ±
+ 0.43
+
9.49
+
+ ±
+ 0.50
+
+
+
K-means
+
5.47
+
+ ±
+ 1.14
+
8.77
+
+ ±
+ 1.85
+
+
+
+
+
Table 1 and Table 2 show the expected trend in which the average
+ testing errors are higher for the extrapolation tasks than they are
+ for the interpolation task. The results from random splitting are
+ informative if the model is primarily used in interpolation
+ settings. However, these errors are likely unrealistically low if
+ the model is intended to make predictions on new molecules that are
+ chemically dissimilar to those in the training set. Performance is
+ worse on the extrapolative data splits, which present a more
+ challenging task, but these errors should be more representative of
+ evaluating a new sample that is out-of-scope. Together, these tables
+ demonstrate the utility of astartes in
+ allowing users to better understand the likely performance of their
+ model in different settings.
+
Several approaches could be taken to further reduce the errors
+ presented here. One could pre-train on additional data or fine-tune
+ with experimental values. Ensembling is another established method
+ to improve model predictions.
+
+
+
+ Related Software and Code Availability
+
In the machine learning space, astartes
+ functions as a drop-in replacement for the ubiquitous
+ train_test_split from scikit-learn
+ (Pedregosa
+ et al., 2011). Transitioning existing code to use this new
+ methodology is as simple as running
+ pip install astartes, modifying an
+ import statement at the top of the file, and
+ then specifying an additional keyword parameter.
+ astartes has been especially designed to allow
+ for maximum interoperability with other packages, using few
+ dependencies, supporting all platforms, and validated support for
+ Python 3.7 through 3.11. Specific tutorials on this transition are
+ provided in the online documentation for
+ astartes, which is available on
+ GitHub.
+
Here is an example workflow using
+ train_test_split taken from the
+ scikit-learn documentation
+ (Pedregosa
+ et al., 2011):
+ import numpy as np
+from sklearn.model_selection import train_test_split
+
+X, y = np.arange(10).reshape((5, 2)), range(5)
+
+X_train, X_test, y_train, y_test = train_test_split(
+ X, y, test_size=0.33, random_state=42)
+
To switch to using astartes,
+ from sklearn.model_selection import train_test_split
+ becomes from astartes import train_test_split
+ and the call to split the data is nearly identical and simple in the
+ extensions that it provides:
+ import numpy as np
+from astartes import train_test_split
+
+X, y = np.arange(10).reshape((5, 2)), range(5)
+
+X_train, X_test, y_train, y_test = train_test_split(
+ X, y, test_size=0.33, sampler="kmeans", random_state=42)
+
With this small change, an extrapolative sampler based on k-means
+ clustering will be used.
+
Inside cheminformatics, astartes makes use
+ of all molecular featurization options implemented in
+ AIMSim
+ (Bhattacharjee
+ et al., 2023), which includes those from virtually all popular
+ descriptor generation tools used in the cheminformatics field.
+
The codebase itself has a clearly defined contribution guideline
+ and thorough, easily accessible documentation.
+ astartes uses GitHub actions for Constant
+ Integration testing including unit tests, functional tests, and
+ regression tests. To emphasize the reliability and reproducibility of
+ astartes, the data splits used to generate
+ Table 1 and Table 2 are included in the regression tests. Test
+ coverage currently sits at >99%, and all proposed changes are
+ subjected to a coverage check and merged only if they cover all
+ existing and new lines added as well as satisfy the regression
+ tests.
+
+
+ Acknowledgements
+
The authors thank all users who participated in beta testing and
+ release candidate testing throughout the development of
+ astartes. Authors Kevin Spiekermann and William
+ Green gratefully acknowledge financial support from BASF under award
+ number 88803720. Authors Jackson Burns and William Green gratefully
+ acknowledge financial support from the U.S. Department of Energy,
+ Office of Science, Office of Advanced Scientific Computing Research,
+ Department of Energy Computational Science Graduate Fellowship under
+ Award Number DE-SC0023112. Authors Himaghna Bhattacharjee and
+ Dionisios Vlachos contribution was primarily supported by the National
+ Science Foundation under Grant No. 2134471
+
+
+ Disclaimer
+
This report was prepared as an account of work sponsored by an
+ agency of the United States Government. Neither the United States
+ Government nor any agency thereof, nor any of their employees, makes
+ any warranty, express or implied, or assumes any legal liability or
+ responsibility for the accuracy, completeness, or usefulness of any
+ information, apparatus, product, or process disclosed, or represents
+ that its use would not infringe privately owned rights. Reference
+ herein to any specific commercial product, process, or service by
+ trade name, trademark, manufacturer, or otherwise does not necessarily
+ constitute or imply its endorsement, recommendation, or favoring by
+ the United States Government or any agency thereof. The views and
+ opinions of authors expressed herein do not necessarily state or
+ reflect those of the United States Government or any agency
+ thereof.
+
+
+
+
+
+
+
+ PedregosaF.
+ VaroquauxG.
+ GramfortA.
+ MichelV.
+ ThirionB.
+ GriselO.
+ BlondelM.
+ PrettenhoferP.
+ WeissR.
+ DubourgV.
+ VanderplasJ.
+ PassosA.
+ CournapeauD.
+ BrucherM.
+ PerrotM.
+ DuchesnayE.
+
+ Scikit-learn: Machine learning in Python
+
+ 2011
+ 12
+ 2825
+ 2830
+
+
+
+
+
+ GéronAurélien
+
+
+ O’Reilly Media, Inc.
+ 2019
+
+
+
+
+
+ RamsundarBharath
+ EastmanPeter
+ WaltersPatrick
+ PandeVijay
+
+
+ O’Reilly Media, Inc.
+ 2019
+
+
+
+
+
+ LakshmananValliappa
+ RobinsonSara
+ MunnMichael
+
+
+ O’Reilly Media, Inc.
+ 2020
+
+
+
+
+
+ HuyenChip
+
+
+ O’Reilly Media, Inc.
+ 2022
+
+
+
+
+
+ WangAnthony Yu-Tung
+ MurdockRyan J.
+ KauweSteven K.
+ OliynykAnton O.
+ GurloAleksander
+ BrgochJakoah
+ PerssonKristin A.
+ SparksTaylor D.
+
+ Machine learning for materials scientists: An introductory guide toward best practices
+
+ ACS Publications
+ 2020
+ 32
+ 12
+ 10.1021/acs.chemmater.0c01907.s001
+ 4954
+ 4965
+
+
+
+
+
+ SpiekermannKevin A.
+ StuyverThijs
+ PattanaikLagnajit
+ GreenWilliam H.
+
+ Comment on ‘physics-based representations for machine learning properties of chemical reactions’
+
+ IOP Publishing
+ 2023
+ 4
+ 4
+ 048001
+
+
+
+
+
+
+ RamakrishnanRaghunathan
+ DralPavlo O.
+ RuppMatthias
+ LilienfeldO. Anatole von
+
+ Quantum Chemistry Structures and Properties of 134 Kilo Molecules
+
+ Nature Publishing Group
+ 2014
+ 1
+ 1
+ 10.1038/sdata.2014.22
+ 1
+ 7
+
+
+
+
+
+ RuddigkeitLars
+ Van DeursenRuud
+ BlumLorenz C.
+ ReymondJean-Louis
+
+ Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17
+
+ ACS Publications
+ 2012
+ 52
+ 11
+ 10.1021/ci300415d
+ 2864
+ 2875
+
+
+
+
+
+ SpiekermannKevin A.
+ PattanaikLagnajit
+ GreenWilliam H.
+
+ High Accuracy Barrier Heights, Enthalpies, and Rate Coefficients for Chemical Reactions
+
+ Nature Publishing Group
+ 2022
+ 9
+ 1
+ 10.1038/s41597-022-01529-6
+ 1
+ 12
+
+
+
+
+
+ SpiekermannKevin A.
+ PattanaikLagnajit
+ GreenWilliam H.
+
+ High accuracy barrier heights, enthalpies, and rate coefficients for chemical reactions
+ Zenodo
+ 202204
+ https://zenodo.org/record/6618262#.YyXlICHMI0Q
+ 10.5281/zenodo.6618262
+
+
+
+
+
+ SpiekermannKevin A.
+ PattanaikLagnajit
+ GreenWilliam H.
+
+ Fast predictions of reaction barrier heights: Toward coupled-cluster accuracy
+
+ ACS Publications
+ 2022
+ 126
+ 25
+ 10.1021/acs.jpca.2c02614
+ 3976
+ 3986
+
+
+
+
+
+ SpiekermannKevin A.
+ PattanaikLagnajit
+ GreenWilliam H.
+ YangKevin
+ SwansonKyle
+ JinWengong
+ ColeyConnor
+ EidenPhilipp
+ GaoHua
+ Guzman-PerezAngel
+ HopperTimothy
+ KelleyBrian
+ MatheaMiriam
+ others
+
+ 202302
+ https://github.com/kspieks/chemprop/tree/barrier_prediction
+
+
+
+
+
+ YangXin
+ WangYifei
+ ByrneRyan
+ SchneiderGisbert
+ YangShengyong
+
+ Concepts of artificial intelligence for computer-assisted drug discovery
+
+ ACS Publications
+ 2019
+ 119
+ 18
+ 10520
+ 10594
+
+
+
+
+
+ BanniganPauric
+ AldeghiMatteo
+ BaoZeqing
+ HäseFlorian
+ Aspuru-GuzikAlan
+ AllenChristine
+
+ Machine learning directed drug formulation development
+
+ Elsevier
+ 2021
+ 175
+ 113806
+
+
+
+
+
+
+ JhaSwarn
+ YenMatthew
+ SalinasYazmin
+ PalmerEvan
+ VillafuerteJohn
+ LiangHong
+
+ Learning-assisted materials development and device management in batteries and supercapacitors: Performance comparison and challenges
+
+ Royal Society of Chemistry
+ 2023
+ 11
+ 3904
+ 3936
+
+
+
+
+
+ KompEvan
+ JanulaitisNida
+ ValleauStéphanie
+
+ Progress Towards Machine Learning Reaction Rate Constants
+
+ Royal Society of Chemistry
+ 2022
+ 24
+ 10.1039/d1cp04422b
+ 2692
+ 2705
+
+
+
+
+
+ WeiJing
+ ChuXuan
+ SunXiang-Yu
+ XuKun
+ DengHui-Xiong
+ ChenJigen
+ WeiZhongming
+ LeiMing
+
+ Machine learning in materials science
+
+ Wiley Online Library
+ 2019
+ 1
+ 3
+ 338
+ 358
+
+
+
+
+
+ MeredigBryce
+ AntonoErin
+ ChurchCarena
+ HutchinsonMaxwell
+ LingJulia
+ ParadisoSean
+ BlaiszikBen
+ FosterIan
+ GibbonsBrenna
+ Hattrick-SimpersJason
+ MehtaApurva
+ WardLogan
+
+ Can machine learning identify the next high-temperature superconductor? Examining extrapolation performance for materials discovery
+
+ Royal Society of Chemistry
+ 2018
+ 3
+ 5
+ 10.1039/d1cp04422b
+ 819
+ 825
+
+
+
+
+
+ DurdySamantha
+ GaultoisMichael W.
+ GusevVladimir V.
+ BollegalaDanushka
+ RosseinskyMatthew J.
+
+ Random projections and kernelised leave one cluster out cross validation: Universal baselines and evaluation tools for supervised machine learning of material properties
+
+ Royal Society of Chemistry
+ 2022
+ 1
+ 10.1039/d2dd00039c
+ 763
+ 778
+
+
+
+
+
+ TricaricoGiovanni A.
+ HofmansJohan
+ LenselinkEelke B.
+ RamosMiriam López
+ DréanicMarie-Pierre
+ StoutenPieter FW
+
+ Construction of balanced, chemically dissimilar training, validation and test sets for machine learning on molecular datasets
+
+ 2022
+ 10.26434/chemrxiv-2022-m8l33-v2
+
+
+
+
+
+ TerronesGianmarco G.
+ DuanChenru
+ NandyAditya
+ KulikHeather J.
+
+ Low-cost machine learning prediction of excited state properties of iridium-centered phosphors
+
+ Royal Society of Chemistry
+ 2023
+ 14
+ 10.1039/d2sc06150c
+ 1419
+ 1433
+
+
+
+
+
+ StuyverThijs
+ ColeyConnor W.
+
+ Quantum Chemistry-Augmented Neural Networks for Reactivity Prediction: Performance, Generalizability, and Explainability
+
+ AIP Publishing LLC
+ 2022
+ 156
+ 8
+ 10.1063/5.0079574
+ 084104
+
+
+
+
+
+
+ HeinenStefan
+ RudorffGuido Falk von
+ LilienfeldO. Anatole von
+
+ Toward the Design of Chemical Reactions: Machine Learning Barriers of Competing Mechanisms in Reactant Space
+
+ AIP Publishing LLC
+ 2021
+ 155
+ 6
+ 10.1063/5.0059742
+ 064105
+
+
+
+
+
+
+ BilodeauCamille
+ KazakovAndrei
+ MukhopadhyaySukrit
+ EmersonJillian
+ KalantarTom
+ MuznyChris
+ JensenKlavs
+
+ Machine learning for predicting the viscosity of binary liquid mixtures
+
+ Elsevier
+ 2023
+ 10.2139/ssrn.4289793
+ 142454
+
+
+
+
+
+
+ JornerKjell
+ BrinckTore
+ NorrbyPer-Ola
+ ButtarDavid
+
+ Machine Learning Meets Mechanistic Modelling for Accurate Prediction of Experimental Activation Energies
+
+ Royal Society of Chemistry
+ 2021
+ 12
+ 3
+ 10.26434/chemrxiv.12758498
+ 1163
+ 1175
+
+
+
+
+
+ LandrumGreg
+ others
+
+ RDKit: Open-Source Cheminformatics
+ 2006
+ https://www.rdkit.org
+
+
+
+
+
+ BemisGuy W.
+ MurckoMark A.
+
+ The Properties of Known Drugs. 1. Molecular Frameworks
+
+ ACS Publications
+ 1996
+ 39
+ 15
+ 10.1021/jm9602928
+ 2887
+ 2893
+
+
+
+
+
+ MorganHarry L.
+
+ The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service
+
+ ACS Publications
+ 1965
+ 5
+ 2
+ 10.1021/c160017a018
+ 107
+ 113
+
+
+
+
+
+ RogersDavid
+ HahnMathew
+
+ Extended-connectivity fingerprints
+
+ ACS Publications
+ 2010
+ 50
+ 5
+ 10.1021/ci100050t
+ 742
+ 754
+
+
+
+
+
+ YangKevin
+ SwansonKyle
+ JinWengong
+ ColeyConnor
+ EidenPhilipp
+ GaoHua
+ Guzman-PerezAngel
+ HopperTimothy
+ KelleyBrian
+ MatheaMiriam
+ others
+
+ Analyzing Learned Molecular Representations for Property Prediction
+
+ ACS Publications
+ 2019
+ 59
+ 8
+ 10.1021/acs.jcim.9b00237.s001
+ 3370
+ 3388
+
+
+
+
+
+ BhattacharjeeHimaghna
+ BurnsJackson
+ VlachosDionisios G.
+
+ AIMSim: An accessible cheminformatics platform for similarity operations on chemicals datasets
+
+ 2023
+ 283
+ 0010-4655
+ https://www.sciencedirect.com/science/article/pii/S0010465522002983
+ 10.1016/j.cpc.2022.108579
+ 108579
+
+
+
+
+
+
diff --git a/joss.05996/10.21105.joss.05996.pdf b/joss.05996/10.21105.joss.05996.pdf
new file mode 100644
index 0000000000..7e77e6ca45
Binary files /dev/null and b/joss.05996/10.21105.joss.05996.pdf differ