The Japanese version of this article is available here: 分子記述子:構造と特性をつなぐ架け橋.
Introduction
In the field of cheminformatics or materials informatics, the need to evaluate physical or chemical properties without extensive laboratory testing has led to the development of in silico models. The method, known as Quantitative Structure-Property Relationship (QSPR), establishes mathematical relationships between molecular structures and their properties. At the heart of QSPR lies molecular descriptors, which serve as the numerical representation of molecular structures.
The Role of Molecular Descriptors
As defined by Todeschini and Consonni [1]:
"A molecular descriptor is the end result of a logical and mathematical procedure that converts the chemical information encoded in the symbolic representation of a molecule into a useful number or the result of some standardized experiment."
The primary motivation for developing descriptors is to improve the accuracy of predicting physical properties, especially those that are expensive or time-consuming to measure experimentally. Therefore, molecular descriptors play a fundamental role in QSPR and other in silico models.
From Human-Readable to Machine-Readable
Traditionally, chemical structures have been represented using "structural formulas" and "chemical formulas." While these traditional notations excel at communicating molecular architecture to trained chemists, they pose significant challenges for computational analysis and database operations. To bridge this gap, descriptors have been developed to convert chemical information into strings or numbers that computers can process. This digital representation of chemical structures has become fundamental to modern cheminformatics research, enabling everything from virtual screening of drug candidates to predicting novel materials.
Classification of Molecular Descriptors
Molecular descriptors can be classified into several categories based on the level of structural information they encode:
0D Descriptors: Consist of structural (constitutional) and quantity (count) descriptors
Examples: molecular weight, number of specific atoms (H, O, halogens), number of rings, number of double bonds
1D Descriptors: Composed of structural fragments or molecular fingerprints
Examples: functional groups like hydroxyl, carboxyl, ketone, azide, amino
2D Descriptors: Primarily topological descriptors based on graph theory or treat chemical structures as molecular graphs
Examples: vertex degree, adjacency matrix, distance matrix
3D Descriptors: Describe properties based on three-dimensional structures
Examples: Results from first-principles calculations (Bond Gap, LUMO-HOMO), van der Waals volume, molecular volume, surface area, radius, electrostatic descriptors
Chemical descriptor classification and examples
From Molecules to Descriptor to Properties
It is helpful to consider chemoinformatics model building as a two-step journey: encoding and modeling. The first step, encoding, involves converting a molecular structure into a set of molecular descriptors including 2D or 3D structural information. The second step, modeling, utilizes machine learning techniques to establish QSPR between the descriptors and the target properties. This two-step approach in chemoinformatics has applications and significance in the field of chemistry and beyond. It enables rapid screening of large chemical libraries, allowing researchers to quickly identify promising candidates for further investigation.
Our implementations
1. Molecular Descriptors
We, MI-6, have implemented numerous molecular descriptor calculation methods, providing over 5,000 different descriptors to address various material properties. These methods include: Mordred[2], RDKit[3], ISIDA[4], CircuS[5], WHALES[6], CDK[7], CATS[8]…etc.
2. Molecular Fingerprints
To capture diverse structural features of molecules, we have implemented several molecular fingerprints: Morgan fingerprints, RDKit fingerprints, MACCS keys, Extended Reduced Graph[9], MinHashed atom-pair fingerprints[10]…etc.
3. Embedding Techniques
Embeddings translate high-dimensional vectors into relatively low-dimensional spaces. We have implemented commonly used pretrained embedding systems from deep learning, such as Mol2vec[11], Recurrent Neural Networks (RNN), BERT (Bidirectional Encoder Representations from Transformers)[12]...etc.
4. Group-Contribution Methods
Group-contribution methods[13] are simple yet powerful tools for estimating properties of arbitrary organic compounds using only structural information. We have developed a novel group-contribution-based predictive model that provides approximately 50 estimated properties, covering basic physical chemistry, thermodynamics, safety and environment-related fields.
5. DFT and MD Calculations
Density Functional Theory (DFT) calculations provide electronic properties, including HOMO (Highest Occupied Molecular Orbital), LUMO (Lowest Unoccupied Molecular Orbital), atomic charge, dipole polarizability.
We perform molecular Dynamics (MD) simulations to determine equilibrium achievement, execute non-equilibrium MD simulation (NEMD) for thermal conductivity calculation and calculate various physical property values.
6. Customized Descriptors
The cost-benefit analysis of using molecular descriptors is case-dependent and requires careful evaluation. MI-6 offers:
- Feasible descriptor designs for accurate prediction of complex material properties
- Interpretable descriptors based on domain knowledge
Our customized approach ensures that the descriptors are tailored to the specific needs of each project, balancing accuracy, interpretability, and computational efficiency.
Conclusion
Molecular descriptors are essential tools in chemical informatics, enabling the conversion of complex chemical information into numerical values that can be processed by computers. This capability facilitates the development of predictive models, reducing the need for expensive and time-consuming laboratory experiments. As the field advances, the development and refinement of molecular descriptors continue to play a crucial role in enhancing our understanding of chemical structures and their properties.
References
[1] Todeschini R, Consonni V (2009) Molecular descriptors for chemoinformatics. Wiley-VCH, Weinheim
[2] Moriwaki, H.; Tian, Y.-S.; Kawashita, N.; Takagi, T. Mordred: A Molecular Descriptor Calculator. Journal of Cheminformatics 2018, 10 (1). https://doi.org/10.1186/s13321-018-0258-y.
[3] RDKit. https://www.rdkit.org/
[4] Fiorella Ruggiu; Marcou, G.; Alexandre Varnek; Horvath, D. ISIDA Property‐Labelled Fragment Descriptors. Molecular informatics 2010, 29 (12), 855–868. https://doi.org/10.1002/minf.201000099.
[5] Byadi, S.; Gantzer, P.; Gimadiev, T.; Sidorov, P. DOPtools: A Python Platform for Descriptor Calculation and Model Optimization. Overview and Usage Guide. 2024. https://doi.org/10.26434/chemrxiv-2024-23v3c.
[6] Grisoni, F.; Merk, D.; Consonni, V.; Hiss, J. A.; Tagliabue, S. G.; Todeschini, R.; Schneider, G. Scaffold Hopping from Natural Products to Synthetic Mimetics by Holistic Molecular Similarity. Communications Chemistry 2018, 1 (1). https://doi.org/10.1038/s42004-018-0043-x.
[7] Egon Willighagen; Mayfield, J. E.; Alvarsson, J.; Berg, A.; Carlsson, L.; Jeliazkova, N.; Kuhn, S.; Tomáš Pluskal; Rojas-Chertó, M.; Spjuth, O.; torrance, gilleain; Evelo, C. T.; Guha, R.; Steinbeck, C. The Chemistry Development Kit (CDK) V2.0: Atom Typing, Depiction, Molecular Formulas, and Substructure Searching. Journal of Cheminformatics 2017, 9 (1). https://doi.org/10.1186/s13321-017-0220-4.
[8] Reutlinger, M.; Koch, C. P.; Reker, D.; Todoroff, N.; Schneider, P.; Rodrigues, T.; Schneider, G. Chemically Advanced Template Search (CATS) for Scaffold-Hopping and Prospective Target Prediction for “Orphan” Molecules. Molecular Informatics 2013, 32 (2), 133–138. https://doi.org/10.1002/minf.201200141.
[9] Nikolaus Stiefl; Watson, I.; Baumann, K.; Zaliani, A. ErG: 2D Pharmacophore Descriptions for Scaffold Hopping. 2005, 46 (1), 208–220. https://doi.org/10.1021/ci050457y.
[10] Riniker, S.; Landrum, G. A. Open-Source Platform to Benchmark Fingerprints for Ligand-Based Virtual Screening. Journal of Cheminformatics 2013, 5 (1). https://doi.org/10.1186/1758-2946-5-26.
[11] Jaeger, S.; Fulle, S.; Turk, S. Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. Journal of Chemical Information and Modeling 2018, 58 (1), 27–35. https://doi.org/10.1021/acs.jcim.7b00616.
[12] Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv:2010.09885 [physics, q-bio] 2020.
[13] Gani, R. Group Contribution-Based Property Estimation Methods: Advances and Perspectives. Current Opinion in Chemical Engineering 2019, 23, 184–196. https://doi.org/10.1016/j.coche.2019.04.007.
[14] Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; Wiley-Vch: Weinheim ; New York, 2000.
[15] Mitchell, J. B. O. Machine Learning Methods in Chemoinformatics. Wiley Interdisciplinary Reviews: Computational Molecular Science 2014, 4 (5), 468–481. https://doi.org/10.1002/wcms.1183.
[16] Mauri, A.; Consonni, V.; Todeschini, R. Molecular Descriptors. Handbook of Computational Chemistry 2017, 2065–2093. https://doi.org/10.1007/978-3-319-27282-5_51.
[17] Grisoni, F.; Ballabio, D.; Todeschini, R.; Consonni, V. Molecular Descriptors for Structure-Activity Applications: A Hands-on Approach. Methods in Molecular Biology (Clifton, N.J.) 2018, 1800, 3–53. https://doi.org/10.1007/978-1-4939-7899-1_1.