Wikipedia: Cheminformatics is the use of computer and informational techniques applied to a range of problems in the field of chemistry. These in silico techniques are used, for example, in pharmaceutical companies and academic settings in the process of drug discovery. These methods can also be used in chemical and allied industries in various other forms.
Datagrok provides first-class support for small molecules, as well as most popular building blocks for cheminformatics. It understand several popular notations for representing chemical (sub)structures, such as SMILES and SMARTS. Molecules can be rendered in either 2D or 3D with different visualization options. They can be sketched as well. Chemical properties, descriptors, and fingerprints can be extracted. Predictive models that accept molecules as an input can be easily trained, assessed, executed, deployed, reused by other scientists, and used in pipelines or in info panels.
Several toxicity and drug-likeness prediction models are supported. Substructure and similarity search works out-of-the box for imported data, and can be efficiently utilized for querying databases using Postgres chemical cartridge. To further explore collections of molecules, use advanced tools such as diversity search and similarity search.
Simply import the dataset as you normally would - by opening a file, querying a database, connecting to a webservice, or by any other method. The platform is smart enough to automatically recognize chemical structures.
Sketch a molecule using the built-in editor, or retrieve one by entering a compound identifier. The following compound identifiers are natively understood since they have a prefix that uniquely identifies source system: SMILES, InChI, InChIKey, CHEMBL, MCULE, comptox, and zinc. The rest of the 30+ identifier systems can be referenced by prefixing source name followed by colon to the identifier, i.e. 'pubchem:11122'.
Many viewers, such as grid, scatter plot, network diagram, tile viewer, bar chart, form viewer, and trellis plot will recognize and render chemical structures.
Chemical intelligence tools are natively integrated into the platform, so in most cases the
appropriate functionality is automatically presented based on the user actions and context.
For instance, when user clicks on a molecule, it becomes a current object,
and its properties are shown in the property panel. To see chemically-related actions
applicable for the specified column, right-click on the column, and look under
Current column | Chem
and Current column | Extract
. Alternatively, click on the column of
interest, and expand the 'Actions' section in the property panel.
Check out 'Tools | Chemistry' to see additional functionality.
As always, it is a good idea to search for functionality using the smart search (Alt+Q), or
by opening the registry of available functions Help | Functions
.
Use 'Extract' popup menu to calculate the following properties: formula, drug likeness, acceptor count, donor count, logP, logD, polar surface area, rotatable bond count, stereo center count.
Chemical descriptors are numerical features extracted from chemical structures for molecular data mining, compound diversity analysis and compound activity prediction. In addition to properties, the platform also makes it easy to compute different sets of molecular descriptors. Supported descriptor sets are: Lipinski, Crippen, EState, EState VSA, Fragments, Graph, MolSurf, QED.
Fingerprints are a very abstract representation of certain structural features of a molecule. Similarity measures, calculations that quantify the similarity of two molecules, and screening, a way of rapidly eliminating molecules as candidates in a substructure search, are both processes that use fingerprints. Datagrok supports the following fingerprints: RDKFingerprint, MACCSKeys, AtomPair, TopologicalTorsion, Morgan/Circular.
We have implemented few tools that help scientists analyze a collection of molecules in terms of molecular similarity. Both tools are based on applying different distance metrics (such as Tanimoto) to fingerprints.
- Similarity Search - finds structures similar to the specified one
- Diversity Search - finds 10 most distinct molecules
These tools can be used together as a collection browser. 'Diverse structures' window shows different classes of compounds present in the dataset; when you click on a molecule representing a class, similar molecules will be shown in the 'Similar structures' window.
- Toxicity - predicts the following toxicity properties: mutagenicity, tumorigenicity, irritating effects, reproductive effects.
- Drug likeness - a score that shows how likely this molecule is to be a drug. The score comes with an interpretation of how different sub-structure fragments contribute to the score.
In contrast to the physical predictive models, machine learning predictive models do not have any intrinsic knowledge about the physical and biological processes. Instead, they use techniques such as random forests or deep learning to discern mathematical relationships between empirical observations of small molecules and extrapolate them to predict chemical, biological and physical properties of novel compounds.
Datagrok enables machine learning predictive models by using chemical properties, descriptors, and fingerprints as features, and the observed properties as results when building predictive models. This lets scientists build predictive models that can be trained, assessed, executed, reused by other scientists, and used in pipelines.
See Cheminformatics predictive modeling for more details and a demo of building and applying a model.
References:
To assure the quality of subsequent analysis and predictive models development, Datagrok provides convenient tools for chemical dataset curation. This curation assumes that one might modify the dataset for specific purposes and process the situations when equal chemical entities have different representations. The latter interferes with the representation of the molecule in descriptor space and may lead to inconsistent analysis results and non-robust models.
Curation tools include, but are not limited to:
- kekulization
- normalization
- neutralization
- tautomerization
- selection of the main component
See Chemical dataset curation for more details, and a demo with curation examples.
References:
Grok lets users easily and efficiently convert molecule identifiers between different source systems, including proprietary company identifiers.
Supported sources are: chembl, pdb, drugbank, pubchem_dotf, gtopdb, ibm, kegg_ligand, zinc, nih_ncc, emolecules, atlas, chebi, fdasrs, surechembl, pubchem_tpharma, pubchem, recon, molport, bindingdb, nikkaji, comptox, lipidmaps, carotenoiddb, metabolights, brenda, pharmgkb, hmdb, nmrshiftdb2, lincs, chemicalbook, selleck, mcule, actor, drugcentral, rhea
To map the whole column containing identifiers, use #{x.ChemMapIdentifiers} function.
IUPAC name is located in the "Properties" panel.
In order to retrieve a single structure by an identifier, it might be handy to use Sketcher
Click on a molecule to select it as a current object. This will bring up this molecule's properties to the property panel. The following panels are part of the 'chem' plugin:
- Structure - 2D structure
- Properties - all above-mentioned properties
- SDF - molfile
- 3D - interactive 3D rendering
- Toxicity - results of the toxicity prediction
- Drug likeness - a score that shows how likely this molecule is to be a drug. The score comes with an interpretation of how different sub-structure fragments contribute to the score.
- Identifiers - all known identifiers for the specified structure (UniChem)
- Patents - patents associated with that structure (SureChEMBL)
- ChEMBL similar structures
- ChEMBL substructure search
- Gasteiger Partial Charges visualization
- Structural alerts
In addition to these pre-defined info panels, users can develop their own using any scripting language supported by the Grok platform. For example, #{x.demo:pythonscripts:GasteigerPartialCharges}.
To search for molecules within the table that contain specified substructure, click on the molecule column, and press Ctrl+F. To add a substructure filter to column filters, click on the '☰' icon on top of the filters, and select the molecular column under the 'Add column filter' submenu.
The maximum common substructure (MCS) problem is of great importance in multiple aspects of cheminformatics. It has diverse applications ranging from lead prediction to automated reaction mapping and visual alignment of similar compounds.
To find MCS for the column with molecules, run Chem | Find MCS
command from column's context menu. To execute
it from the console, use chem:findMCS(tableName, columnName)
command.
R-Group Analysis is a common function in chemistry. Typically, it involves R-group decomposition, followed by the visual analysis of the obtained R-groups. Grok's chemically-aware Trellis Plot is a natural fit for such an analysis.
{:height="100px" width="60px"}
The scaffold concept is widely applied in medicinal chemistry. Scaffolds are mostly used to represent core structures of bioactive compounds. Although the scaffold concept has limitations and is often viewed differently from a chemical and computational perspective, it has provided a basis for systematic investigations of molecular cores and building blocks, going far beyond the consideration of individual compound series.
Applies a specified reaction to two columns containing molecules. The output table contains a row for each product produced by applying the reaction to the inputs. Each row contains the product molecule, index information, and the reactant molecules that were used.
'Do Matrix Expansion': If checked, each reactant 1 will be combined with each reactant 2 yielding the combinatorial expansion of the reactants. If not checked, reactants 1 and 2 will be combined sequentially, with the longer list determining the number of output rows.
Corresponding function: #{x.demo:pythonscripts:TwoComponentReaction}
See details here.
The following cheminformatics-related functions are exposed:
- #{x.ChemSubstructureSearch}
- #{x.ChemFindMCS}
- #{x.ChemDescriptors}
- #{x.ChemGetRGroups}
- #{x.ChemFingerprints}
- #{x.ChemSimilaritySPE}
- #{x.ChemSmilesToInchi}
- #{x.ChemSmilesToCanonical}
- #{x.ChemMapIdentifiers}
Lot of chemical analysis is implemented using scripting functionality:
- #{x.ChemScripts:ButinaMoleculesClustering}
- #{x.ChemScripts:FilterByCatalogs}
- #{x.ChemScripts:GasteigerPartialCharges}
- #{x.ChemScripts:MurckoScaffolds}
- #{x.ChemScripts:SaltStripper}
- #{x.ChemScripts:SimilarityMapsUsingFingerprints}
- #{x.ChemScripts:ChemicalSpaceUsingtSNE}
- #{x.ChemScripts:TwoComponentReaction}
- #{x.ChemScripts:ChemicalSpaceUsingUMAP}
- #{x.ChemScripts:USRCAT}
Function | Molecules | Execution time, s |
---|---|---|
ChemSubstructureSearch | 1M | 70 |
ChemFindMcs | 100k | 43 |
ChemDescriptors (201 descriptor) | 1k | 81 |
ChemDescriptors (Lipinski) | 1M | 164 |
ChemGetRGroups | 1M | 233 |
ChemFingerprints (TopologicalTorsion) | 1M | 782 |
ChemFingerprints (MACCSKeys) | 1M | 770 |
ChemFingerprints (Morgan/Circular) | 1M | 737 |
ChemFingerprints (RDKFingerprint) | 1M | 2421 |
ChemFingerprints (AtomPair) | 1M | 1574 |
ChemSmilesToInChI | 1M | 946 |
ChemSmilesToInChIKey | 1M | 389 |
ChemSmilesToCanonical | 1M | 331 |
Efficient substructure and similarity searching in a database containing information about molecules is a key requirement for any chemical information management system. This is typically done by installing a so-called chemical cartridge on top of a database server. The cartridge extends server's functionality with the molecule-specific operations, which are made efficient by using chemically-aware indexes, which are often based on molecular fingerprints. Typically, these operations are functions that can be used as part of the SQL query.
Datagrok provides mechanisms for the automated translation of queries into SQL statements for several commonly used chemical cartridges. We support the following ones:
See DB Substructure and similarity search for details.
- ChEMBL (Postgres)
- UniChem (Postgres)
- TODO: cheminformatics training/demo datasets
See also:
References: