2025.01.09

Organic Semiconductor Molecular Design with Hierarchical AI Models

Ph.D. in Engineering specializing in Chemoinformatics (University of Tokyo, Funatsu Laboratory). Previously worked at Kao Corporation focusing on Materials Informatics (MI). Expert in organic small molecules and polymers, with extensive experience in condition/formulation optimization. Proven track record in applying MI analysis across diverse research themes.

Overview
The limitations of current AI-assisted molecular design for OSC
Our proposed solution
Results of Organic Molecular Generation
Conclusion
References

Overview

Organic semiconductors (OSCs) represent a crucial class of materials that combine electronic properties with the versatile characteristics of organic compounds. Their potential in biomedical applications stems from key advantages including solution-processability and mechanical flexibility.

The field of molecular design has been transformed by the recent emergence of generative models, which have proven particularly valuable in creating novel and selective molecules for specific applications. These models have found significant success in drug design where most implementations use string-based SMILES notation. Notable examples include Recurrent Neural Networks (RNN), BERT (Bidirectional Encoder Representations from Transformers), and Variational Autoencoders (VAE).

The application of these generative models to organic semiconductor design presents an exciting opportunity to discover new OSCs with enhanced properties that better align with desired applications. This computational approach could significantly accelerate the development of next-generation organic semiconductors while potentially uncovering novel molecular architectures that might be overlooked through traditional design methods.

The limitations of current AI-assisted molecular design for OSC

OSCs pose unique challenges due to their complex molecular architectures. These representations require positional numbers to close rings, which increases the likelihood of generating invalid molecules. Additionally, changing a small part of a ring often necessitates rewriting the entire string. Another issue is that OSCs typically have long molecular chain sequences, resulting in lengthy SMILES strings that pose challenges for models. Therefore, an alternative method is to adopt graph-based generative models. Some interesting research has proposed various algorithms for graph generative models, including JT-VAE. With the graph-based decoder, generated molecules can be incrementally expanded while maintaining chemical validity at every step.

Another challenge lies in the limited chemical space explored in traditional research, resulting in a dearth of experimentally validated structures. This scarcity of data hampers the development of accurate machine learning models for property prediction. Additionally, the complex structure-property relationships of OSCs make it difficult to accurately model their electronic and structural properties computationally. Furthermore, the synthesis and characterization of OSCs can be complex and time-consuming, slowing down the discovery and optimization process. Overcoming these challenges is crucial for realizing the full potential of OSCs.

Our proposed solution

To address the challenges of representing complex OSC structures and limited data availability, we propose a novel approach. By leveraging graph-based representations and scaffold-based data augmentation, we aim to expand the chemical space and improve the accuracy of property prediction models. The workflow diagram below outlines the key steps involved in our method. Our objective is to enhance the representation and development of organic semiconductors through graph-based modeling for molecular generation and carrier mobility prediction.

Fig.1 AI-Driven Organic Semiconductor Design Workflow

Data Augmentation Framework

We implemented two complementary augmentation strategies to enhance dataset diversity:

Mix-Key Scaffold Augmentation: Edge swapping within molecular scaffolds while preserving functional groups
Hetero Shuffling: This approach systematically replaces atoms in heterocyclic scaffolds using predefined atom sets, generating novel molecular combinations while maintaining structural validity.

Figure 2 illustrates the data augmentation process used to expand the chemical space for organic semiconductor molecular design. The workflow consists of two key strategies: (1) Mix-Key Scaffold Augmentation, which rearranges bonds within scaffolds while preserving functional groups, followed by ring size filtering to ensure chemical validity, and (2) Hetero Shuffling, which replaces atoms in heterocyclic scaffolds using predefined atom sets. These steps generate a diverse and chemically valid dataset, progressing from 670 initial scaffolds to a final augmented dataset of 10,670 unique structures. This enriched dataset enables the training of generative and predictive models, facilitating the discovery of novel high-performance candidates.

Fig.2 Data Augmentation Workflow for Organic Semiconductor Design

Generative Model Architecture

We utilize the generative model called HierVAE, which is a motif-based hierarchical variational autoencoder model. This method is designed for encoding molecules using larger building blocks at the motif level, resulting in fewer required decoding steps. HierVAE has been shown to achieve better reconstruction accuracy, training speed, and generation metrics than previous research. The motifs are extracted from the dataset using frequently occurring substructures as basic building blocks. The motif extraction process consists of three steps: finding bridge bonds, detaching bridge bonds, and selecting fragment motifs with a specific minimum frequency criterion. Motifs not considered fragments will be broken into smaller fragments. The autoencoder operates hierarchically in a coarse-to-fine manner across three main layers: the motif layer, attachment layer, and atom layer.

Property Prediction Framework

This framework investigates three properties in semiconductor materials: distinguishing between N-type and P-type semiconductors, and classifying mobility levels within each type. We use molecular fingerprints as input features. The methodology employs several widely-used machine learning algorithms such as Support Vector Machines, Random Forest, and K-Nearest Neighbors. Model optimization is performed through grid search and cross-validation techniques. Given the dataset's class imbalance, the study employs a comprehensive set of evaluation metrics to ensure robust performance assessment.

Results of Organic Molecular Generation

Molecular generation of OSC

Our AI-driven approach demonstrates remarkable speed in generating potential OSC candidates, producing 10,000 novel molecules in just 30 minutes on a standard personal computer. The implementation of scaffold-based data augmentation techniques has proven crucial in expanding the structural diversity of generated molecules. This approach significantly reduces the average scaffold similarity among generated compounds, leading to a broader exploration of chemical space and increasing the likelihood of discovering molecules with unique properties. To evaluate the distribution and coverage of our generated molecules, we employed ECFP4 fingerprints with 2,048 bits, followed by T-SNE dimensionality reduction for visualization. The analysis reveals that our data augmentation approach achieves a more continuous distribution pattern compared to baseline methods. Additionally, the generated molecules demonstrate superior coverage of the chemical space, indicating a more comprehensive exploration of potential semiconductor candidates. This improved distribution suggests that our method can access previously unexplored regions of chemical space, potentially leading to the discovery of novel OSC structures with unique properties.

Fig.3 Comparison of Chemical Space Coverage

The purple circle highlights the region of chemical space where the HVAE-augmented molecules (red dots) show significantly higher density and diversity compared to the HVAE raw molecules (blue dots) and the original training set (green dots). This indicates that the augmentation strategy has effectively expanded the chemical space, allowing access to previously unexplored regions and

Evaluation of High-Mobility P-Type Semiconductors

We conducted a comprehensive assessment of the generated organic semiconductor (OSC) molecules using our top-performing models from each classification task. The analysis focused particularly on identifying high-mobility P-type semiconductors. The selected models, which demonstrated superior performance in their respective classification tasks, were applied to evaluate the characteristics of our generated molecular set. The high-mobility molecules for P-type are visualized in the following figure.

Fig.4 Generated Candidates for High-Mobility P-Type Organic Semiconductor Molecules

Conclusion

In this article, we introduced an AI-driven approach to accelerate the discovery and optimization of organic semiconductors (OSCs). By leveraging hierarchical AI models and scaffold-based data augmentation strategies, we addressed key challenges in representing complex OSC structures and expanding the chemical space. The success of our approach in identifying high-mobility P-type semiconductors underscores its applicability to the design of next-generation materials for advanced electronic and biomedical devices. Future directions include extending property predictions to encompass thermal and optical characteristics, as well as experimental validation of the generated molecules.

References

Lu, N.; Li, L.; Geng, D.; Liu, M. A Review for Polaron Dependent Charge Transport in Organic Semiconductor. Organic Electronics 2018, 61, 223–234. https://doi.org/10.1016/j.orgel.2018.05.053.
Zhang, X.; Wei, G.; Sheng, Y.; Bai, W.; Yang, J.; Zhang, W.; Ye, C. Polymer-Unit Fingerprint (PUFp): An Accessible Expression of Polymer Organic Semiconductors for Machine Learning. ACS Applied Materials & Interfaces 2023, 15 (17), 21537–21548. https://doi.org/10.1021/acsami.3c03298.
View more posts. Generate possible heteroaromatic cores from query molecule #RDKit #chemoinformatics. Is life worth living? https://iwatobipen.wordpress.com/2019/03/29/generate-possible-heteroaromatic-cores-from-query-molecule-rdkit-chemoinformatics (accessed 2024-12-10).
rdkit. UGM_2017/Notebooks/Cole-Enumerate-Heterocycles.ipynb at master · rdkit/UGM_2017. GitHub. https://github.com/rdkit/UGM_2017/blob/master/Notebooks/Cole-Enumerate-Heterocycles.ipynb (accessed 2024-12-10).
Jiang, T.; Wang, Z.; Yu, W.; Wang, J.; Yu, S.; Bao, X.; Wei, B.; Xuan, Q. Mix-Key: Graph Mixup with Key Structures for Molecular Property Prediction. Briefings in Bioinformatics 2024, 25 (3). https://doi.org/10.1093/bib/bbae165.
Jin, W.; Barzilay, Dr. Regina.; Tommi Jaakkola. Hierarchical Generation of Molecular Graphs Using Structural Motifs. PMLR 2020, 4839–4848.