The use of data-driven experimental design and surrogate modeling is becoming increasingly important in chemical process engineering, particularly when physical experimentation is time-consuming or costly. 

This article presents a chemical process optimization scheme that applies the concept of Materials Informatics to sequentially improve process conditions through Bayesian optimization, while constructing surrogate models based on simulation data.

Fig.1 The framework for chemical process optimization presented in this article

This article presents a practical case study applying a data-driven Design of Experiments (DoE) framework, combined with Bayesian optimization, to the process design of butanol dehydration using H-ZSM5 as a catalyst.

Our goal is to examine how variations in initial sample size and optimization iteration number affect surrogate model performance and candidate identification. Unlike traditional grid-based DoE, our approach integrates simulation-based design and adaptive learning via Gaussian Process (GP) modeling. The case provides insight into how practitioners might tune their DoE-BO pipelines for improved model accuracy and optimization efficiency.

This article is structured in two parts:

  1. Process and Parameter Definition
  2. Modeling and Optimization Results

In addition to presenting the core methodology and results, we also reflect critically on the limitations of our current approach, including aspects of reproducibility, uncertainty quantification, and generalizability. These are discussed transparently as part of our ongoing effort to improve data-driven methods for process development.

Process and Parameter Definition

The selected case study is the dehydration of 1-butanol into 1-butene and dibutyl ether via parallel pathways, catalyzed by H-ZSM5 zeolite. This system is drawn from the study by Abbasi et al. (2022), which addresses both catalyst and process design in a multiscale framework. While their work includes material selection, our focus here is exclusively on process configuration and parameter tuning through simulation and optimization.

The relevance of this system stems from its role in renewable chemical manufacturing. 1-Butene is an industrially important olefin (e.g. in elastomer production), and the prospect of deriving it from bio-butanol opens up sustainable process pathways.

Step 1: Process Definition

We define a simplified process configuration assuming complete butanol conversion. The flowsheet includes the following key steps:

Heating → Reaction → Liquefaction → Flash Separation

This configuration was implemented using DWSIM (version 8.7.1), an open-source chemical process simulator suitable for early-stage design and surrogate data generation. Figure 2 illustrates the flowsheet model replicating the full-conversion assumption.

This flowsheet is a simplified model of the butanol dehydration process, comprising unit operations including heating, reaction, liquefaction, and flash separation. Simulations were conducted using DWSIM, and the model serves as a platform for generating the output data necessary for Bayesian optimization.

Fig.2 Butanol dehydration process flowsheet

We note that alternative process configurations are possible (as explored in the reference article), but are outside the scope of this study.

Step 2: Input and Output Definition

Our design objective is to maximize conversion and selectivity toward 1-butene under a constrained set of operational variables. These are defined as follows, with bounds adopted from the referenced literature:

Inputs

Location

Range

1-butanol flow

Butanol

6~15 t/h

Reactor temp.

H-001

200~260 ⁰C

Reactor volume

R-001

0.5~1.5 m3

Liquefaction temp.

H-002

5~15 ⁰C

Vessel pressure

V-001

1 atm

Outputs

Location

Target

1-butanol conversion

R-001

>99%

1-butene purity

Butene

>99%

Unit operation settings and thermodynamic models were aligned with those in Abbasi et al. to ensure consistency in simulation response.

Step 3: Initial Sampling

Initial data points (X) were generated using an optimal DoE strategy across the defined input ranges. Sample sizes were chosen as 50, 100, 250, and 500 points. We fixed the number of discrete experimental levels to 5 after finding that 3 levels yielded poor initial model performance during preliminary testing.

Step 4: Simulation of Initial Points

Each DoE point was simulated using DWSIM to generate output data (Y). Points where all target thresholds were trivially satisfied were removed to preserve optimization value and test convergence behaviour. Similarly, physically invalid points were excluded through postprocessing.

The resulting dataset sizes (after filtering) were:

  • 12 points (from initial 50)
  • 26 points (from 100)
  • 63 points (from 250)
  • 125 points (from 500)

This data reduction highlights the importance of data screening in simulation-based DoE workflows, which should be carefully documented in practical deployments.

Step 5: Bayesian Optimization

Bayesian optimization was applied using Gaussian Process surrogate models to iteratively propose promising input candidates. 

Step 6: Candidate Evaluation and Iteration

Proposed candidate combinations were simulated in DWSIM and either retained or discarded based on output performance. This process was repeated for up to four iterations per initial dataset.

Modeling and Optimization Results

Surrogate Model Performance

To evaluate model accuracy across different initial sample sizes, we trained Gaussian Process (GP) regression models at each optimization iteration. Figure 3 presents the coefficient of determination (R²) for each target output over four iterations.

The prediction accuracy of the surrogate model (Gaussian process regression) at each iteration is presented in terms of the R² value. This visualization captures changes in model performance with varying initial sample sizes and as the iterations proceed, enabling an evaluation of the effectiveness of sequential learning via Bayesian optimization.

Fig.3 R² performance of GP models by sample size and iteration

Across all sample sizes, R² values were generally high (above 0.9), particularly for the 1-butene purity and recovery outputs. We observed a slight improvement in R² with increasing sample size, but diminishing returns were evident between 250 and 500 points.

An important nuance emerged in the smaller-sample cases (e.g., 26 points): additional optimization iterations improved model fit substantially. This indicates that BO iterations can partly compensate for limited initial data — a finding of practical relevance for cost-sensitive experimental design.

Candidate Identification and Evaluation

Candidate points proposed during each optimization iteration were evaluated via DWSIM simulations. Results for butanol conversion across all scenarios are shown in Figure 4, separated into:

  • Initial DoE points
  • Optimization candidates (partial target satisfaction)
  • Optimal candidates (all targets met)
This figure illustrates the distribution of butanol conversion rates at the initial DoE points and at the candidate points proposed by Bayesian optimization. It provides a visual comparison of performance improvement trends with different initial sample sizes and optimization steps, suggesting the process by which the candidate points achieve the target.

Fig.4 Butanol conversion across initial and optimized candidates

Findings:

  • Larger initial sample sizes tended to yield successful candidates in fewer iterations.
  • Small-sample cases could still find optimal candidates, but often required all four iterations.
  • There was no clear clustering of optimal points, likely due to the relatively broad input bounds. This suggests that without prior narrowing, the optimization landscape remains multimodal.

Optimal Input Combinations

Figure 5 presents the final optimal input combinations identified at each initial sample size.

The combinations of optimal parameters obtained via Bayesian optimization for each initial sample size are presented. Comparing the distribution of these optimal points within the input space allows for a visual assessment of the impact of sample size on the breadth of exploration and the tendency toward convergence.

Fig.5 Optimal input combinations by initial sample size

As sample size increased, the optimizer was able to explore a wider variety of viable regions in the input space. However, the distribution of optimal points did not converge to a narrow region. This reinforces the idea that tighter input bounds (based on prior process knowledge) are essential when seeking precise and robust operating conditions.

Conclusions

This study demonstrates the feasibility of applying data-driven DoE combined with Bayesian optimization to the process design of butanol dehydration. Our results highlight the trade-offs between initial sampling effort and optimization performance, and offer preliminary guidance for configuring such frameworks in practice.

Key takeaways:

  • Initial sample size significantly affects initial model quality and optimization efficiency.
  • Optimization iterations can improve model accuracy, especially in small-sample regimes.
  • Broad input bounds limit convergence and complicate optimization — tighter prior bounds are recommended.
  • Surrogate modeling is effective, but should be accompanied by uncertainty quantification and validation procedures.

This study demonstrates the feasibility and potential of data-driven DoE for chemical process design. Future work should include:

  • Integrating materials design for a comprehensive materials-process design approach.
  • Recycling material and process models to guide product development.
  • Applying this framework to design processes that address sustainability concerns, such as minimizing greenhouse gas emissions or reducing waste.

Reference

  1. Abbasi, M. R.; Galvanin, F.; Blacker, A. J.; Sorensen, E.; Shi, Y.; Dyer, P. W.; Gavriilidis, A. Process-Oriented Approach towards Catalyst Design and Optimisation. Catalysis Communications 2022, 163, 106392. https://doi.org/10.1016/j.catcom.2021.106392.
  2. DWSIM  –  The  Open  Source  Chemical  Process  Simulator, (2024). https://dwsim.org/