2024.12.17

Introduction of Interpretable Molecular Descriptors via Group-Contribution Methods

Master's in chemoinformatics (Funatsu Laboratory, The University of Tokyo). Research themes cover concentration prediction and anomaly detection using soft sensor. Currently working on Hands-on MI®︎ projects.

1. Introduction
2. Implementation and features
3. Conclusions and prospects
References

Group-contribution methods (GC methods) are a simple and powerful tool to estimate the properties of organic compounds by feeding only its structural information. In this article, we briefly introduce these methods. We also present MI-6’s GC-based predictive model and discuss its key features and further prospects.

1. Introduction

In today’s chemical and materials industries, millions of organic compounds have been identified. Novel compounds are continuously synthesized worldwide. How can we determine the properties of such a vast number of compounds? Due to the limitations of cost and time, it is impossible to measure their properties in situ. However, advances in computational technology have made it possible to estimate their properties in silico.

Various in silico methods have been proposed in the field of chemistry, including quantum chemical calculation, quantitative structure–activity/property relationships(QSAR/QSPR), and group-contribution methods (GC methods). Although GC methods were introduced more than half a century ago, they remain widely used due to their simplicity and robustness. Figure 1 illustrates the position of GC methods among in silico and conventional approaches . In general, there is a trade-off between simplicity and prediction power when evaluating methods;, however GC methods are known for achieving a balance between the two.

FIg.1 Scheme for multi-level property estimation

The basic idea of GC methods is that a molecular structure can be broken into fragments. For organic compounds, functional groups can be the most frequently adopted fragments, but GC methods do not limit fragments to just functional groups.The definition of fragments depends on how we define them to ensure they have meaningful values. Each fragment is defined to have a partial value called a contribution. A property of a compound is then obtained by summing up the contributions presented in its molecular structure. For this reason, GC methods are often described as additive methods or building block methods.

FIg.2 Schematic representation of GC methods

2. Implementation and features

Before presenting the implementation of our model, we discuss intrinsic features of GC methods. While simple and powerful, GC methods have some limitations:

If fragments are missing, their contributions cannot be identified. Therefore GC-based models are limited to specific families of compounds.
Proximity and isomer effects cannot be predicted reliably. This is also a limitation of SMILES (Simplified Molecular Input Line Entry System) notation.
Accuracy decreases for large multifunctional molecules.
GC methods are applicable only to organic systems.

Despite those limitations, the biggest advantage of GC methods is that they require only the structural information of a chemical compound. By preparing its SMILES notation, we can estimate its properties without conducting costly and time-consuming experiments. Moreover, GC methods can be used to estimate the properties of not only pure compounds but also mixtures.

Considering the intrinsic features above, we developed a novel GC-based predictive model, that fully utilizes the advantages of GC methods. Our model provides approximately 50 estimated properties, covering basic physical chemistry, chemical engineering, and safety and environment related fields. Two tables illustrate examples of these estimated properties.

Table1. Properties estimated by our GC-based predictive model-1

Table2. Properties estimated by our GC-based predictive model-2

In addition to the basic chemistry properties, safety and environment related properties are also a key feature of our predictive model. Chemistry is no exception to the growing emphasis on sustainable development, and the necessity for so-called green chemistry has been widely recognised. GlaxoSmithKline plc (GSK plc) proposed the GSK solvent selection guide. USEtox developed a model to assess human and ecological toxicity of chemicals. In this context, GC methods have been studied for estimating safety and environment related properties. We integrated the findings of those studies into our model.Additionally, we provide the visualization of molecules. Figure 3 shows an example of biodegradability. The contributions of each fragment are visualized. Compared to conventional models, which provide only basic chemical properties, our predictive model extends the possibilities of GC methods.

Fig.3 Fragment-Based Visualization of Biodegradability

3. Conclusions and prospects

We have introduced GC methods so far. Although GC methods are an old model, their application still relies on simple, fast, and robust predictability. In developing our predictive model, we took full advantage of GC methods. In this section, we discuss the future prospects of GC methods.

As mentioned in Section 1, defining fragments is the core idea of GC methods. Gani R defined primary and secondary groups— up to 424 groups— based on SMILES notation. By redefining these groups, the performance of GC-based models can be improved. Another suggestion is to explore structural notations other than SMILES. To address the limitations of SMILES, studies have been conducted to propose novel notations, such as BigSMILES and SMARTS (SMILES Arbitrary Target Specification).

Fig.4 Application possibilities of the properties estimated by GC methods

In addition, the estimated properties can also be combined with other methods. In the recent decade, artificial intelligence and data science (AI/DS) has emerged as major trends in chemistry R&D as well. However, one of the obstacles to applying AI/DS is the shortage of data, particularly in this field. When training AI/DS models, insufficient datasets often lead to overfitting, reducing model predictability. Here, our GC-based model can offer a solution. As depicted in Figure 4, the estimated properties have potential to be used as descriptors for training more sophisticatedAI/DS models. The possibilities of GC methods aligns with the evolving trends of the Fourth Industrial Revolution.

References

Gani, R. Group Contribution-Based Property Estimation Methods: Advances and Perspectives. Current Opinion in Chemical Engineering 2019, 23, 184–196. https://doi.org/10.1016/j.coche.2019.04.007.
Byrne, F. P.; Jin, S.; Paggiola, G.; Petchey, T. H. M.; Clark, J. H.; Farmer, T. J.; Hunt, A. J.; Robert McElroy, C.; Sherwood, J. Tools and Techniques for Solvent Selection: Green Solvent Selection Guides. Sustainable Chemical Processes 2016, 4 (1). https://doi.org/10.1186/s40508-016-0051-z.
USEtox® | Developed by the USEtox® Team. Usetox.org. https://usetox.org/ (accessed 2024-12-02).
Hukkerikar, A. S.; Kalakul, S.; Sarup, B.; Young, D. M.; Sin, G.; Gani, R. Estimation of Environment-Related Properties of Chemicals for Design of Sustainable Processes: Development of Group-Contribution+ (GC+) Property Models and Uncertainty Analysis. Journal of Chemical Information and Modeling 2012, 52 (11), 2823–2839. https://doi.org/10.1021/ci300350r.
Simon, R. H. M. Estimation of Critical Properties of Organic Compounds by the Method of Group Contributions. A. L. Lyderren. Engineering Experiment Station Report 3. College of Engineering, University of Wisconsin, Madison, Wisconsin(1955). 22 Pages. AIChE Journal 1956, 2 (3), 12S12S. https://doi.org/10.1002/aic.690020328.