From 28fc8c204dea89fde49270c05a08dbdb677f7d07 Mon Sep 17 00:00:00 2001 From: The Open Journals editorial robot <89919391+editorialbot@users.noreply.github.com> Date: Thu, 9 Nov 2023 13:52:19 +0000 Subject: [PATCH] Creating 10.21105.joss.05763.jats --- joss.05763/10.21105.joss.05763.jats | 708 ++++++++++++++++++++++++++++ 1 file changed, 708 insertions(+) create mode 100644 joss.05763/10.21105.joss.05763.jats diff --git a/joss.05763/10.21105.joss.05763.jats b/joss.05763/10.21105.joss.05763.jats new file mode 100644 index 0000000000..faf311ccf7 --- /dev/null +++ b/joss.05763/10.21105.joss.05763.jats @@ -0,0 +1,708 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +5763 +10.21105/joss.05763 + +OpenFEPOPS: A Python implementation of the FEPOPS +molecular similarity technique + + + +https://orcid.org/0000-0001-7161-9503 + +Chen +Yan-Kai + + + + +https://orcid.org/0000-0002-3469-1546 + +Houston +Douglas R. + + + + +https://orcid.org/0000-0001-8920-3522 + +Auer +Manfred + + + + + +https://orcid.org/0000-0001-6996-3663 + +Shave +Steven + + +* + + + +School of Biological Sciences, University of Edinburgh, The +King’s Buildings, Max Born Crescent, CH Waddington Building, Edinburgh, +EH9 3BF, United Kingdom. + + + + +Xenobe Research Institute, P. O. Box 3052, San Diego, +California, 92163, United States. + + + + +* E-mail: + + +6 +8 +2023 + +8 +91 +5763 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +Python +molecular similarity +virtual screening +pharmacophores +feature points + + + + + + Summary +

OpenFEPOPS is an open-source Python implementation of the FEature + POint PharmacophoreS (FEPOPS) molecular similarity technique + (Jenkins + et al., 2004; + Jenkins, + 2013; + Nettles + et al., 2007) enabling descriptor generation, comparison, and + ranking of molecules in virtual screening campaigns. Ligand based + virtual screening + (Ripphausen + et al., 2011) is a fundamental approach undertaken to expand + hit series or perform scaffold hopping whereby new chemistries and + synthetic routes are made available in efforts to remove undesirable + molecular properties and discover better starting points in the early + stages of drug discovery + (Hughes + et al., 2011). Typically, these techniques query hit molecules + against proprietary, in-house, or publicly available repositories of + small molecules in the hope of finding close matches which will + display similar activities to the query based on the molecular + similarity principle which states that similar molecules should have + similar properties and make similar interactions + (Cortés-Ciriano + et al., 2020). Often batteries of these similarity measures are + used in parallel, helping to score molecules from many different + subjective viewpoints and measures of similarity + (Baber + et al., 2006). The central idea behind FEPOPS is reducing the + complexity of molecules by merging of local atomic environments and + atom properties into ‘feature points’. This compressed feature point + representation has been used to great effect as noted in literature, + helping researchers identify active and potentially therapeutically + valuable small molecules. By default, OpenFEPOPS uses literature + reported parameters which show good performance in retrieval of active + lead- and drug-like small molecules within virtual screening + campaigns, with feature points capturing charge, lipophilicity, and + hydrogen bond acceptor and donor status. When run with default + parameters, OpenFepops compactly represents molecules using seven sets + of four feature points, with each feature point encoded into 22 + numeric values, resulting in a compact representation of 616 bytes per + molecule. By extension, this allows the indexing of a compound archive + containing 1 million small molecules using 587.5 MB of data. Whilst + more compact representations are readily available, the FEPOPS + technique strives to capture tautomer and conformer information, first + through enumeration and then through diversity driven selection of + representative FEPOPS descriptors to capture the diverse states that a + molecule may adopt.

+
+ + Statement of need +

At the time of writing, OpenFEPOPS is the + only publicly available implementation of the FEPOPS molecular + similarity technique. Whilst used within industry and referenced + extensively in literature, it has been unavailable to researchers as + an open-source tool. We welcome contributions and collaborative + efforts to enhance and expand OpenFEPOPS using the associated GitHub + repository. It is therefore hoped that this will allow the technique + to be used not only for traditional small molecule molecular + similarity, but also in new emerging fields such as protein design and + featurisation of small- and macro-molecules for both predictive and + generative tasks.

+
+ + Brief software description +

Whilst OpenFEPOPS has included functionality for descriptor caching + and profiling of libraries, the core functionality of the package is + descriptor generation and scoring.

+ + <italic>Descriptor generation:</italic> +

The OpenFEPOPS descriptor generation process as outlined in + [fig:descriptor_generation] + follows;

+ + +

Tautomer enumeration

+ + +

For a given small molecule, OpenFEPOPS uses RDKit + (Landrum, + 2013) to iterate over molecular tautomers. By + default, there is no limit to the number of recoverable + tautomers but a limit may be imposed which may be necessary + if adapting the OpenFEPOPS code to large macromolecules and + not just small molecules.

+
+
+
+ +

Conformer enumeration

+ + +

For each tautomer, up to 1024 conformers are sampled by + either complete enumeration of rotatable bond states (at the + literature reported optimum increment of 90 degrees) if + there are five or less rotatable bonds, or through random + sampling of 1024 possible states if there are more than 5 + rotatable bonds.

+
+
+
+ +

Defining feature points

+ + +

The KMeans algorithm + (Arthur + & Vassilvitskii, 2007) is applied to each + conformer of each tautomer to identify four (by default) + representative or central points, into which the atomic + information of neighbouring atoms is collapsed. As standard, + the atomic properties of charge, logP, hydrogen bond donor, + and hydrogen bond acceptor status are collapsed into four + feature points per unique tautomer conformation. The RDKit + package is used to calculate these properties with the + iterative Gasteiger charges algorithm + (Gasteiger + & Marsili, 1980) applied to assign atomic + charges, the Crippen method + (Wildman + & Crippen, 1999) used to assign atomic logP + contributions, and hydrogen bond acceptors and donors + identified with appropriate + (Gillet + et al., 1998) SMARTS substructure queries. These + feature points are encoded to 22 numeric values (a FEPOP) + comprising four points, each with four properties, and six + pairwise distances between these points. With many FEPOPS + descriptors collected from a single molecule through + tautomer and conformer enumeration, this set of + representative FEPOPS should capture every possible state of + the original molecule.

+
+
+
+ +

Selection of diverse FEPOPS

+ + +

From the collection of FEPOPS derived from every tautomer + conformation of a molecule, the K-Medoid algorithm + (Park + & Jun, 2009) is applied to identify seven (by + default) diverse FEPOPS which are thought to best capture a + fuzzy representation of the molecule. These seven FEPOPS + each comprise 22 descriptors each, totaling 154 32-bit + floating point numbers or 616 bytes.

+
+
+
+
+ +

OpenFEPOPS descriptor generation showing the capture + of tautomer and conformer information from a single input + molecule.

+ +
+

Descriptor generation with OpenFEPOPS is a compute intensive task + and as noted in literature, designed to be run in situations where + large compound archives have had their descriptors pre-generated and + are queried against relatively small numbers of new molecules for + which descriptors are not known and are ad-hoc generated. To enable + use in this manner, OpenFEPOPS provides functionality to cache + descriptors through specification of database files, either in the + SQLite or JSON formats.

+
+ + Scoring and comparison of molecules based on their molecular + descriptors + + +

Sorting

+ + +

With seven (by default) diverse FEPOPS representing a + small molecule, the FEPOPS are sorted by ascending + charge.

+
+
+
+ +

Scaling

+ + +

Due to the different scales and distributions of features + comprising FEPOPS descriptors, each FEPOP is centered and + scaled according to observed mean and standard deviations of + the same features within a larger pool of molecules. By + default, these means and standard deviations have been + derived from the DUDE + (Mysinger + et al., 2012) diversity set which captures known + actives and decoys for a diverse set of therapeutic targets + (See the Jupyter notebook ‘Explore_DUDE_diversity_set.ipynb’ + in the source repository for further methods).

+
+
+
+ +

Scoring

+ + +

The Pearson correlation coefficient is calculated for the + scaled descriptors of the first molecule to the scaled + descriptors of the second.

+
+
+
+
+

Literature highlights that the choice of the Pearson correlation + coefficient leads to high background scores as it is highly unlikely + to see little correlation between any molecule due to fundamental + limitations of chemistry and geometry. Therefore, unrelated + molecules are likely to have FEPOPS similarity scores higher than + those encountered with more traditional techniques such as bitstring + fingerprints and Tanimoto or Dice similarity measures.

+

The predictive performance of OpenFEPOPS was evaluated using the + DUDE + (Mysinger + et al., 2012) diversity set. This dataset comprises eight + protein targets accompanied by decoy ligands and known active + ligands. For each target, actives were used as queries to retrieve + all other actives. Retrieval rankings were assessed using the AUROC + (Area Under Receiver Operating Characteristic) metric + (Fawcett, + 2006) and scores for each active averaged within targets to + assign a final average AUROC score for each target. Table 1 shows + the average AUROC scores for DUDE diversity set targets along with + scores obtained using the popular Morgan 2, MACCS, and RDKit + fingerprints as implemented in RDKit and scored using the Tanimoto + distance metric. See the Jupyter notebook + ‘Explore_DUDE_diversity_set.ipynb’ in the source repository for + further methods and data availability using the FigShare service. + All evaluated similarity techniques perform comparably with average + AUROC scores of 0.723, 0.692, 0.687, and 0.701 for Morgan 2, MACCS, + RDKit and OpenFEPOPS respectively. OpenFEPOPS achieves comparable + performance to other metrics using 3D representations of molecules + across a range of tautomer states which is in stark contrast to the + approaches taken by the other connectivity and fingerprint-based + methods. Diversity in similarity techniques allows potentially + interesting actives undiscoverable with one technique to be flagged + and ranked highly by another, offering new routes to novelty, new + chemistries, and efficacious leads from early-stage drug discovery + efforts.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TargetMorgan 2MACCSRDKitOpenFEPOPS
akt10.8360.7410.8330.831
ampc0.7840.6730.6600.639
cp3a40.6030.5820.6130.647
cxcr40.6970.8540.5920.899
gcr0.6700.6660.7080.616
hivpr0.7800.6810.7590.678
hivrt0.6510.6700.6600.582
kif110.7630.6680.6720.713
+
+

Table 1: Averaged AUROC scores by target and + molecular similarity technique for the DUDE diversity set. Across + all datasets, 19 small molecules out of 112,796 were excluded from + analysis mainly due to issues in parsing to valid structures using + RDKit.

+
+ + Availability, usage and documentation +

OpenFEPOPS has been uploaded to the Python Packaging Index under + the name ‘fepops’ and as such is installable using the pip package + manager and the command pip install fepops. + With the package installed, entrypoints are used to expose commonly + used OpenFEPOPS tasks such as descriptor generation and calculation + on molecular similarity, enabling simple command line access without + the need to explicitly invoke a Python interpreter. Whilst + OpenFEPOPS may be used solely via the command line interface, a + robust API is available and may be used within other programs or + integrated into existing pipelines to enable more complex workflows. + Extensive API documentation is available at + https://justinykc.github.io/FEPOPS, along with a concise user-guide + at https://justinykc.github.io/FEPOPS/readme.html

+
+
+ + + + + + + NettlesJames H + JenkinsJeremy L + WilliamsChris + ClarkAlex M + BenderAndreas + DengZhan + DaviesJohn W + GlickMeir + + Flexible 3D pharmacophores as descriptors of dynamic biological space + Journal of Molecular Graphics and Modelling + Elsevier + 2007 + 26 + 3 + 10.1016/j.jmgm.2007.02.005 + 622 + 633 + + + + + + JenkinsJeremy L + GlickMeir + DaviesJohn W + + A 3D similarity method for scaffold hopping from known drugs or natural ligands to new chemotypes + Journal of medicinal chemistry + ACS Publications + 2004 + 47 + 25 + 10.1021/jm049654z + 6144 + 6159 + + + + + + JenkinsJeremy L + + Feature point pharmacophores (FEPOPS) + Scaffold hopping in medicinal chemistry + Wiley Online Library + 2013 + 10.1002/9783527665143.ch10 + 155 + 174 + + + + + + RipphausenPeter + NisiusBritta + BajorathJürgen + + State-of-the-art in ligand-based virtual screening + Drug discovery today + Elsevier + 2011 + 16 + 9-10 + 10.1016/j.drudis.2011.02.011 + 372 + 376 + + + + + + HughesJames P + ReesStephen + KalindjianS Barrett + PhilpottKaren L + + Principles of early drug discovery + British journal of pharmacology + Wiley Online Library + 2011 + 162 + 6 + 10.1111/j.1476-5381.2010.01127.x + 1239 + 1249 + + + + + + Cortés-CirianoIsidro + ŠkutaCtibor + BenderAndreas + SvozilDaniel + + QSAR-derived affinity fingerprints (part 2): Modeling performance for potency prediction + Journal of Cheminformatics + Springer + 2020 + 12 + 1 + 10.1186/s13321-020-00444-5 + 41 + + + + + + + BaberJ Christian + ShirleyWilliam A + GaoYinghong + FeherMiklos + + The use of consensus scoring in ligand-based virtual screening + Journal of chemical information and modeling + ACS Publications + 2006 + 46 + 1 + 10.1021/ci050296y + 277 + 288 + + + + + + MysingerMichael M + CarchiaMichael + IrwinJohn J + ShoichetBrian K + + Directory of useful decoys, enhanced (DUD-e): Better ligands and decoys for better benchmarking + Journal of medicinal chemistry + ACS Publications + 2012 + 55 + 14 + 10.1021/jm300687e + 6582 + 6594 + + + + + + LandrumGreg + + RDKit: Open-source cheminformatics + 2013 + + + + + + ArthurDavid + VassilvitskiiSergei + + K-means++ the advantages of careful seeding + Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms + 2007 + https://dl.acm.org/doi/abs/10.5555/1283383.1283494 + 1027 + 1035 + + + + + + ParkHae-Sang + JunChi-Hyuck + + A simple and fast algorithm for k-medoids clustering + Expert systems with applications + Elsevier + 2009 + 36 + 2 + 10.1016/j.eswa.2008.01.039 + 3336 + 3341 + + + + + + GilletValerie J + WillettPeter + BradshawJohn + + Identification of biological activity profiles using substructural analysis and genetic algorithms + Journal of chemical information and computer sciences + ACS Publications + 1998 + 38 + 2 + 10.1021/ci970431+ + 165 + 179 + + + + + + WildmanScott A + CrippenGordon M + + Prediction of physicochemical parameters by atomic contributions + Journal of chemical information and computer sciences + ACS Publications + 1999 + 39 + 5 + 10.1021/ci990307l + 868 + 873 + + + + + + GasteigerJohann + MarsiliMario + + Iterative partial equalization of orbital electronegativity—a rapid access to atomic charges + Tetrahedron + Elsevier + 1980 + 36 + 22 + 10.1016/0040-4020(80)80168-2 + 3219 + 3228 + + + + + + FawcettTom + + An introduction to ROC analysis + Pattern recognition letters + Elsevier + 2006 + 27 + 8 + 10.1016/j.patrec.2005.10.010 + 861 + 874 + + + + +