2025.02.03

Advanced Mass Spectrometry Analysis: Machine Learning Applications in GC-MS and LC-MS Data Processing

The application of machine learning to characterization is a key technology in Materials Informatics (MI) and Lab Automation (LA) for rapidly and accurately evaluating the properties of materials and samples. In this article, we introduce data-driven analytical approaches that employ machine learning and deep learning to address challenges inherent in mass spectrometry analysis using GC‐MS and LC‐MS, such as the vast amount of data, variability in peak shapes, retention time shifts, and peak overlaps. These approaches enable automated and accelerated pre-processing, enhanced reproducibility, and the identification of unknown components, thereby significantly improving the efficiency and accuracy of analyses.

In Bachelor's and Master's studies, I specialized in nano-material analysis using techniques such as XRD, Raman, XPS, EDS, and sensor data processing. For my Doctoral degree, I integrated machine learning to extract features from human breath and artificial olfactory systems, utilizing GC-MS spectrum and gas sensor data at the University of Tokyo and Kyushu University. Currently employed as a data scientist at MI-6, I am focusing on the development of an automated platform for extracting features from spectral data.

GC-MS and LC-MS Data analysis
Challenge in GC-MS/LC-MS analysis
Data-Driven Solutions to the Challenges
Application: Molecular Identification
Conclusion and Perspectives
Reference

GC-MS and LC-MS Data analysis

Mass spectrometry is an essential tool for the structural analysis and quantification of compounds. In particular, gas chromatography–mass spectrometry (GC‐MS) and liquid chromatography–mass spectrometry (LC‐MS) are widely utilized in diverse fields such as environmental analysis, pharmaceutical research, proteomics, and metabolomics, because they can separate and detect sample components with high precision.

GC-MS combines gas chromatography with mass spectrometry and excels in the analysis of volatile and thermally stable compounds.
LC-MS combines liquid chromatography with mass spectrometry, making it well-suited for analyzing large, highly polar, and thermally labile molecules.

In recent years, there has been a shift from traditional methods that rely on peak extraction and manual parameter adjustments to data-driven analytical approaches employing machine learning and deep learning. This evolution offers several advantages:

Automation and Speed: The entire process—from pre-processing to analysis and interpretation—is automated, enabling rapid handling of vast amounts of data.
Enhanced Reproducibility: Since the results are not dependent on manual adjustments, the consistency of the analytical outcomes is greatly improved.
Identification of Unknown Components: By statistically analyzing the complete dataset, these approaches can suggest candidate compounds that are not present in existing databases.

Challenges in 2D GC-MS analysis. Technical issues such as managing data volume, retention time shifts, and peak overlaps. The application of machine learning is effective in addressing these challenges, with advanced normalization and peak separation techniques contributing to enhanced analytical accuracy. This also supports advancements in Materials Informatics and Lab Automation.

Fig.1 Challenge in GC-MS/LC-MS analysis

Challenge in GC-MS/LC-MS analysis

The analysis of GC-MS and LC-MS data presents unique challenges due to its two-dimensional nature, which requires simultaneous consideration of the chromatographic separation over time and the mass spectral information obtained at each time point.
This combination of two-dimensional data gives rise to various analytical challenges:

Vast amounts of data
Variability in peak shapes
Retention time shifts
Peak overlaps

These issues are further complicated by factors such as instrument variability, contamination, the type of column used, and the need for stringent calibration and normalization procedures.

Read the article about spectral analysis

Data-Driven Solutions to the Challenges

Challenge 1: Vast Amounts of Data

In a single measurement using GC‐MS/LC‐MS, hundreds of time points and thousands of mass spectra are generated, resulting in an enormous total volume of data. Traditional methods struggle with the computational burden of processing such large datasets, in addition to suffering from issues such as manual parameter adjustments and information loss during pre-processing.

To address this, automated segmentation techniques based on machine learning can be employed. For example, by partitioning the chromatogram into appropriate segments and compressing the information of each segment through tensor decomposition, only the essential information is extracted.

Challenge 2: Variability in Peak Shapes

Even for the same compound, peak shapes can vary significantly due to differences in instrumentation or sample conditions. Traditional methods, relying on fixed thresholds or standard peak extraction algorithms, often fail to accurately identify or quantify such variability. Standard algorithms do not sufficiently capture variations in peak width, height, or shape, leading to the loss of critical information.

By employing deep learning models such as convolutional neural networks (CNNs), which use the entire raw data as input, it is possible to learn and accurately detect complex peak shapes and subtle variations. This precise extraction of even minute signals enhances the accuracy of subsequent analyses.

Challenge 3: Retention Time Shifts

The slight shifts in retention time that occur between samples hinder both inter-sample comparisons and peak identification. Traditionally, linear corrections or manual parameter adjustments were relied upon; however, these approaches have limitations, often leading to decreased reproducibility and significant errors in subsequent analyses due to even minimal shifts.

By utilizing deep learning models, the variation patterns in retention time caused by instrument or experimental conditions can be automatically learned and corrected in real time. As a result, comparisons between samples become much more accurate.

Challenge 4: Peak overlaps

In complex mixtures, peaks from multiple components often overlap, making it challenging to extract them as individual peaks. The inability to separate overlapping peaks can lead to significant errors in quantification and identification.

To address this issue, advanced algorithms such as tensor decomposition, multivariate analysis, and even Transformer or graph neural network (GNN) approaches have been developed to effectively separate and identify individual components from overlapping peaks.

Application: Molecular Identification

In the context of molecular identification, traditional methods have primarily relied on spectral databases to identify compounds. However, these databases often face limitations when dealing with unknown compounds due to incomplete reference data and discrepancies in experimental conditions. Machine learning offers several solutions to overcome these challenges.

Application of machine learning in GC-MS/LC-MS analysis. Molecular identification is performed using PCA and CNN, while LSTM extracts elemental information from spectra with high accuracy. This facilitates accelerated pre-processing and the identification of unknown components, thereby enhancing both the efficiency and accuracy of analyses in Materials Informatics and Lab Automation.

Fig.2 Machine Learning Applications in Mass Spectrometry

Elemental Composition Analysis

Deep learning models such as RNNs and LSTMs can automatically analyze high-resolution mass spectral data to determine the number and types of elements present in a compound. This capability allows for accurate molecular formula predictions even for compounds not found in existing databases.

Structural Identification

Advanced algorithms like Transformer networks and graph neural networks (GNNs) can directly predict molecular structures from mass spectra. Transformers effectively capture the interrelationships within different parts of the spectrum, while GNNs excel at analyzing molecular structural patterns, thereby providing significant value in identifying unknown compounds. Furthermore, these approaches extend the capabilities of existing databases, enabling the identification of compounds not previously catalogued and assisting in the validation of experimental results.

Conclusion and Perspectives

In this article, we have discussed the application of machine learning and deep learning techniques to address challenges inherent in GC‐MS and LC‐MS data analysis, such as the vast data volume, variability in peak shapes, retention time shifts, and peak overlaps.

Additionally, we have introduced advanced approaches for molecular identification—including elemental composition analysis and structural determination—that extend the capabilities of conventional database-dependent methods. These technologies offer numerous advantages, including automation, accelerated processing, enhanced reproducibility, and the identification of unknown compounds. However, challenges such as the need for extensive labeled datasets, computational resource demands, and reduced interpretability still remain. By advancing solutions to these issues and further integrating them with Materials Informatics (MI) and Lab Automation (LA), we can achieve more efficient and precise mass spectrometry analyses. If you are interested in these developments or have any inquiries regarding your own data analysis, please do not hesitate to contact us at MI-6.

Reference

S. Moldoveanu and V. David, "Derivatization Methods in GC and GC/MS," 2018, DOI: 10.5772/intechopen.81954.
H. Guo and J. A. MacKay, "Chapter 8 - A pharmacokinetics primer for preclinical nanomedicine research," in Nanoparticles for Biomedical Applications, E. J. Chung, L. Leon, and C. Rinaldi, Eds., Micro and Nano Technologies, Elsevier, 2020, pp. 109–128, DOI: 10.1016/B978-0-12-816662-8.00008-4.
D. K. Pinkerton, K. M. Pierce, and R. E. Synovec, "Chapter 10 - Chemometric Resolution of Complex Higher Order Chromatographic Data with Spectral Detection," in Resolving Spectral Mixtures, C. Ruckebusch, Ed., Data Handling in Science and Technology, vol. 30, Elsevier, 2016, pp. 333–352, DOI: 10.1016/B978-0-444-63638-6.00010-3.
C. A. Smith, E. J. Want, G. O'Maille, R. Abagyan, and G. Siuzdak, "XCMS: Processing Mass Spectrometry Data for Metabolite Profiling Using Nonlinear Peak Alignment, Matching, and Identification," Analytical Chemistry, vol. 78, no. 3, pp. 779–787, 2006, DOI: 10.1021/ac051437y.
M. Katajamaa, J. Miettinen, and M. Orešič, "MZmine: Toolbox for Processing and Visualization of Mass Spectrometry Based Molecular Profile Data," Bioinformatics, vol. 22, no. 5, pp. 634–636, March 2006, DOI: 10.1093/bioinformatics/btk039.
H. Tsugawa et al., "MS-DIAL: Data-Independent MS/MS Deconvolution for Comprehensive Metabolome Analysis," Nature Methods, vol. 12, no. 6, pp. 523–526, 2015, DOI: 10.1038/nmeth.3393.
A. Aldama-Campino, K. Döös, J. Kjellsson, and B. Jönsson, "TRACMASS: Formal Release of Version 7.0," Zenodo, version 7.0-beta, December 2020, DOI: 10.5281/zenodo.4337926.
J.-R. Delorme et al., "The Keck Planet Imager and Characterizer: A Dedicated Single-Mode Fiber Injection Unit for High-Resolution Exoplanet Spectroscopy," arXiv, 2021, DOI: 10.48550/arXiv.2107.12556.
Y. Jiang et al., "GC-MS Fingerprinting Combined with Chemical Pattern-Recognition Analysis Reveals Novel Chemical Markers of the Medicinal Seahorse," Molecules, vol. 28, no. 23, article 7824, 2023, DOI: 10.3390/molecules28237824.
A. Skarysz et al., "Convolutional Neural Networks for Automated Targeted Analysis of Raw Gas Chromatography-Mass Spectrometry Data," 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 2018, pp. 1–8, DOI: 10.1109/IJCNN.2018.8489539.
C.-C. Chen et al., "Logistic Regression Analysis of LC-MS/MS Data of Monomers Eluted from Aged Dental Composites: A Supervised Machine-Learning Approach," Analytical Chemistry, vol. 95, no. 12, pp. 5205–5213, 2023, DOI: 10.1021/acs.analchem.2c04362.
J. Liu, J. Zhang, Y. Luo, S. Yang, J. Wang, and Q. Fu, "Mass Spectral Substance Detections Using Long Short-Term Memory Networks," IEEE Access, vol. PP, pp. 1–1, 2019, DOI: 10.1109/ACCESS.2019.2891548.