Skip to content

A terrible idea: Using Random Forest for root cause analysis in manufacturing

Over the past years, our interactions with industry experts have revealed a significant trend for root cause analysis in manufacturing. As the data coverage in modern factories increases, we are observing a growing number of manufacturers who adopt “off-the-shelf” machine learning (ML) algorithms in favor of traditional correlation-based methods. Despite the initial promise of conventional ML algorithms, like Random Forests, this article demonstrates that they can provide seriously misleading conclusions.

Feature importance is the wrong proxy for root causes

Many manufacturers are trending towards the use of Random Forests for root cause analysis. There’s been an observable shift toward tree-based algorithms to understand the relationship between a set of production parameters and undesirable production outcomes. A popular method involves training a Random Forest to establish predictive relationships, and then utilizing “feature importance” to assess how strongly each production parameter (e.g., temperature) predicts a certain outcome of interest (e.g., quality losses). The key assumption here is that highly predictive parameters are also important for explaining the underlying production problems that need to be addressed. However, the core fallacy of this approach lies in the fact that Random Forests are designed for prediction tasks, which fundamentally differ from the objectives of root cause analysis in manufacturing.

So, why is the use of Random Forests for root cause analysis a bad idea? At its core, the issue is that predictive power should not be confused with causal effects. Root cause analysis aims to identify factors that significantly affect the financial bottom line. However, as we demonstrate in this article, the parameters deemed most predictive by a feature importance analysis don’t necessarily align with these critical factors. Moreover, a factory represents a structured flow of processes. Conventional ML algorithms neglect a factory’s process flow by simplifying it to mere tabular data. This oversimplification can have severe consequences where important causal relationships are entirely overlooked.

In the following, we’ll delve into the limitations of using Random Forests for root cause analysis. We’ll use two straightforward examples from a hypothetical cookie factory, which aim at uncovering the root causes of quality problems. In the first example, we demonstrate that Random Forests are sensitive to outliers and their feature importance overemphasizes the relevance of rare events. In the second example, we show that Random Forests are incapable of exploring root cause chains and thus fail to uncover the true sources of quality problems. Additionally, we’ll demonstrate how EthonAI’s graph-based algorithms effectively address these shortcomings.

The cookie factory

Let’s introduce a practical use-case for our analysis: a cookie factory. In the figure below, you’ll find the layout of this factory, from incoming goods to final quality control. The cookie factory is designed to produce orders in batches of 100 cookies each. While the overall setup is a simplification, it effectively captures the essence of a real-world production environment. Our focus here is to understand how various parameters interact and how they relate to the overall quality of the cookies. To this end, we’ll generate synthetic data based on two different scenarios.

Our cookie factory’s process flow begins with the arrival of raw ingredients. Flour, sugar, butter, eggs, and baking powder are supplied by two different providers, labeled Supplier A and Supplier B. To maintain a standard in each batch, a production order of 100 cookies exclusively uses ingredients from one supplier. Though the ingredients are fundamentally the same, subtle variations between suppliers can influence the final product.

Next is the heart of the cookie factory – the baking process. Here, the ingredients are mixed to form a dough, which is then shaped into cookies. These cookies are baked in one of three identical ovens. Precise control of the baking environment is key, as even minor fluctuations in temperature or baking duration can significantly impact cookie quality. For every batch of cookies, we record specific details: the oven used (Oven_ID), the average temperature during baking, and the exact baking duration. These data points provide valuable insights into the production process.

The final stage in our factory is quality control, which is conducted by an automated visual inspection system. This system spots and rejects any defective cookies – be they broken or burnt. We’ll use “yield” as a quality metric. Yield is defined as the share of cookies in one production order that meet our quality standards and are ultimately delivered to the customer (e.g., if 95 out of 100 cookies pass quality control the yield equals to 95%).

In our subsequent analyses, we’ll dissect how production parameters like Supplier_ID, Oven_ID, Temperature, and Duration influence the quality of our cookies. Our goal is to explore the interplay of these parameters that determines why some cookies make the cut while others have to be thrown away.

For our upcoming examples, we’ll simulate production data from our cookie factory. For this, we will create synthetic data for our production parameters, namely the Supplier_ID, Oven_ID, Temperature, and Duration. Additionally, we have to establish a “ground-truth” formula that models the relationship between these parameters and the cookie quality. We model the quality of each cookie production batch based on the following two parameters: the Temperature and the Duration. We’ll use the following formula to simulate the yield for each production batch:

Here’s a breakdown of the above formula:

  • The ideal baking temperature is set at 200° Celsius. Deviations from this temperature reduce the yield.
  • Similarly, the ideal baking duration is 20 minutes. Any deviation from this time affects the yield negatively.

Let’s consider an illustrative example: Imagine a cookie batch is baked at 210° Celsius for 22 minutes in Oven-1 using ingredients from Supplier A. The yield calculation would be: 100 – 5 (Temperature deviation) – 5 (Duration deviation) = 90%. This means 90 cookies pass quality control and 10 cookies are thrown away. Note that the above formula represents the actual modeled relationships in our scenario, but is assumed to be unknown for our root cause analysis.

Scenario 1: Predictive modeling is not the right objective for root cause analysis

In this first scenario, we’ll expose a critical weakness of Random Forests: their tendency to overfit outliers and, thus, to overestimate the relevance of rare events. While Random Forests can quantify predictive power via feature importance, they ignore the frequency and magnitude of each parameter’s financial impact. Our subsequent example with simulated data sheds light on this critical problem.

Setting the Simulation Scenario for Scenario 1

We start by simulating a dataset with 500 production batches based on our ground-truth quality formula. In order to demonstrate that Random Forests are very sensitive to outliers, we introduce a significant data imbalance into our dataset: only the first batch uses raw ingredients from Supplier B, whereas the remaining 499 batches use ingredients from Supplier A. Furthermore, we assume that after the first batch was produced, one of the cookie factory’s employees accidentally dropped the entire batch to the floor, resulting in all the cookies breaking. Consequently, the first batch of cookies has a yield of 0%. This exaggerated incident is specifically designed to highlight Random Forests’ sensitivity to outlier events. Such outlier events happen rarely, and hence, over an extended period, they impact the quality only minimally. Moreover, since those outlier events are often due to human error, they may not be avoidable. As such, a good root cause analysis should not identify the outlier event or anything that is spuriously correlated with it.

Here’s a snapshot of the simulated dataset:

Root cause analysis with Random Forest and feature importance

We now use a Random Forest model to analyze the above dataset of 500 production batches. The model is trained to predict the yield based on four parameters: Supplier_ID, Oven_ID, Temperature, and Duration. Subsequently, we compute the feature importance of each parameter to identify which of them is the most predictive of the overall cookie yield.

The feature importance analysis identifies Supplier_ID as the most predictive parameter with a score of 0.60, followed by Temperature at 0.21, and Duration at 0.19. This ranking suggests that the supplier has the largest effect on cookie quality. Knowing how the data was generated, which is that the Supplier_ID is unrelated to the quality, we can immediately establish that this finding is wrong. In fact, the Random Forest attributes the Supplier_ID with high feature importance, because Supplier B was only used once for the first production batch (i.e., Batch_ID = 001), which has a yield of 0%, because it was accidentally dropped to the floor. Hence, the Random Forest erroneously placed high importance on this outlier event and identified the Supplier_ID to be the reason for this event, which is also incorrect.

Root cause analysis with EthonAI Analyst

We now apply the EthonAI Analyst software to conduct the same analysis for the first scenario. The EthonAI Analyst makes use of graph-based algorithms that have been particularly designed for root cause analysis in manufacturing. One of their key abilities is to account for the frequency of root causes, which identifies the production problems that truly matter for cost reduction.

Upon analyzing the data, the EthonAI Analyst presents a ranking of parameters based on their impact. It attributes high importance to Temperature and Duration, thereby recognizing these as the primary factors affecting cookie quality. Notably, the Supplier_ID is deemed inconsequential because the EthonAI Analyst effectively avoids the overfitting issues encountered with the Random Forest. This demonstrates the importance of accounting for both the frequency and the impact of root causes to effectively identify the parameters that have a consistent effect on quality.

Scenario 2: A factory cannot be represented by tabular data

In our second scenario, we show how the inability of Random Forests to accurately model process flows leads to missed opportunities for quality improvement. Random Forests operate fundamentally different to the established problem-solving methodologies that are used in manufacturing. For example, the 5-Why method involves a procedure of repeatedly asking “why” to trace a root cause to its origin. However, since conventional ML algorithms treat factories as static data tables rather than dynamic processes, they fail to employ the backtracking logic that is essential for analyzing root cause chains.

Setting the Simulation Scenario for Scenario 2

We again simulate a dataset of 500 production batches with the same quality formula as before. Unlike the previous scenario, both suppliers A and B now supply a similar amount of ingredients across production orders. To illustrate how Random Forests fail to account for ripple effects throughout a factory, we add an additional complexity to the dataset. Specifically, we introduce a calibration issue in the baking process, which affects the temperature measurements of Oven-1, Oven-2, and Oven-3. This creates a root cause chain where the Oven_ID indirectly affects the yield by influencing the temperature. Below we visualize the temperature distributions across the different ovens, which show that Oven-2 and Oven-3 deviate more from the optimal temperature of 200° Celsius than Oven-1.

Root cause analysis with Random Forest and feature importance

We again use a Random Forest to analyze the new dataset of 500 production batches. As before, the model is trained on Supplier_ID, Oven_ID, Temperature, and Duration to predict the resulting yield. We then compute the feature importance to determine the most predictive production parameters.

The model identifies Temperature as the most significant parameter with a feature importance of 0.64, followed by Duration at 0.35. Notably, both Oven_ID and Supplier_ID have a feature importance of 0.00, implying they have no impact on the yield. However, since we know the underlying data generation process, we can confirm that the lack of feature importance attributed to Oven_ID is incorrect. This error occurs because the model fails to capture how Oven_ID indirectly affects yield through the Temperature parameter.

Root cause analysis with EthonAI Analyst

We now repeat the analysis for the second scenario with the EthonAI Analyst software. Unlike Random Forests, the EthonAI Analyst employs our proprietary algorithms that capture process flows by modeling them as a causal graph. This helps identifying complex root cause chains and tracking production problems back to their actual root cause.

Examining the results from the EthonAI Analyst presents a contrasting view to the Random Forest’s results. Like the Random Forest, it identifies Temperature and Duration as critical parameters. However, the EthonAI Analyst correctly recognizes Oven_ID as a significant parameter too. This is clearly illustrated in the extracted graph, which reveals Oven_ID’s indirect influence on yield through the Temperature parameter.


Random forests have become popular for root cause analysis in manufacturing. However, our article demonstrates they have serious limitations. In two simple scenarios with just four production parameters, we demonstrate that Random Forests fail to accurately identify the root causes of simulated quality losses. The first scenario showed their tendency to confuse predictive power with financial impact. The second scenario illustrated their inability to trace chains of root causes. This raises an important concern: Are conventional ML algorithms like Random Forests reliable enough when it comes to analyzing hundreds of production parameters in complex factories? Our findings suggest they are not.

Moving forward, we advocate for the adoption of graph-based algorithms. Compared to conventional ML algorithms, they provide more accurate insights and identify the problems that truly hit the bottom line. We hope this article inspires professionals to pursue more robust and effective root cause analysis in their factories. If you’re intrigued and want to explore the capabilities of graph-based algorithms, we encourage you to book a demo and see the difference for yourself.

Virtual Design of Experiment: How to optimize your production processes through digital tools

The article compares traditional and virtual Design of Experiments in manufacturing. It emphasizes the efficiency of virtual Design of Experiments in optimizing production processes using causal AI, while highlighting the challenges in data collection and advanced statistical software required for their effective implementation.

What is this article about?

Production processes are getting increasingly complex and, as such, it is difficult to make sure that they run optimally. To understand and optimize their production processes, manufacturers often turn to so-called “Design of Experiments” (DOEs), where they systematically test different parameter settings against each other. While DOEs can provide valuable insights to improve production processes, they are also time consuming and costly. Hence, DOEs are typically conducted infrequently and focus only on a subset of the production parameters. Such incomplete optimizations often lead to suboptimal settings.

In this article, we explore a possible solution: Virtual Design of Experiments. Virtual DOEs overcome the major drawbacks of DOEs by building a digital twin of the production process. As such, virtual DOEs allow for a comprehensive optimization in a cheaper and faster way compared to traditional DOEs. This allows them to be run more frequently and results in fully optimized production processes. In the following, we will discuss what DOEs are, how virtual DOEs work, and the challenges related to virtual DOEs.

What is a DOE?

DOE is a statistical method designed to experimentally assess how specific parameters influence outcomes in manufacturing processes. Its origins date back to Ronald Fisher in the 1920s, initially for agricultural applications. In the 1980s, Genichi Taguchi’s methods notably advanced its use in manufacturing. However, DOE’s full potential remains underexploited in the industry, especially outside of sectors like pharmaceuticals and semiconductors. This is often because of the significant time and effort required to master DOEs, which combine statistical know-how with domain-specific knowledge (for an in-depth overview of DOEs, we refer interested readers to Michel Baudin’s blog).

Let’s consider an example, where we investigate the influence of oven temperature on the final quality in a cake factory. The goal is to find the temperature, which results in the best cake quality. To this end, we could use a DOE to investigate the influence of the oven temperature on the cake quality. The core principle is to keep all parameters in the production process constant, while only changing the temperature and observe the quality outcome. For instance, we could bake cakes at two different temperature settings; that is, Setting A = 170°C and Setting B = 180°C using the exact same ingredients. Then, we compare the resulting cake quality for both temperature settings. If the quality increases when we change the temperature from 170°C to 180°C, we have found a strong indication of a better temperature setting and keep it for further experimentation. 

Although the results of our DOE suggest that 180°C is superior to 170°C, it could also be that 190°C is even better than 180°C. Hence, in order to find the best setting for the temperature, we have to iteratively run multiple DOEs. Once we have run multiple DOEs and optimized one parameter, we may want to continue optimizing other parameters as well (e.g., the ingredients). As such, we will have to run multiple DOEs for different parameter combinations, which quickly becomes time consuming and costly. Not only because running a DOE requires planning, but also because the new setting may actually be worse than the old one, which leads to more quality losses.

A potential solution are virtual DOEs, which leverage data collected throughout the production process to “virtually” run DOEs. This not only costs less money, but is also typically faster.

What is a Virtual DOE?

Virtual DOEs have only recently become feasible because of the vast amount of data collected in production processes and advances in statistical methods. Virtual DOEs have the same goal as traditional DOEs: understand how production parameters influence the quality and find the optimal setting for those parameters. Different from traditional DOEs, virtual DOEs do not take place in the actual production process, but virtually in a software. Therefore, there is no need to change the actual production process as the insights are gained by simulating the changes of parameters virtually. This scales better to a large number of parameters and, most importantly, doesn’t require changing the actual production process, which circumvents the risk of reducing the products’ quality.

In a nutshell, when conducting virtual DOEs, we take all available production data to build a digital twin of the production process. Such a virtualization allows us to run virtual experiments with different simulated temperature settings and optimize the parameter for the best quality results.

Challenges of Virtual DOEs

In order to conduct virtual DOEs, there are two major challenges: (1) collecting the right data of the production process and (2) using the right statistical software to build the environment that allows us to virtually emulate the production process. 

While manufacturers already record large amounts of data, it may not always be the data needed for virtual DOEs. In order to run virtual DOEs, we need sufficient variation in production parameters to build a digital twin of a production process. Hence, when collecting data, one should consult process engineers to identify relevant data. This makes setting up virtual DOEs an interdisciplinary initiative, which requires IT experts to work closely with process engineers.

Moreover, the software needed to build virtual DOEs is statistically complex, because it has to make use of causal simulation. Causal simulation requires quantifying the impact of specific changes in production parameters on the final quality output, while controlling for a myriad of confounding variables that could skew results. Furthermore, it must be capable of handling large datasets with varying degrees of variability and correlation to ensure that the virtual experiments closely mimic real-world scenarios. Only recently, statistical methods with these capabilities have transitioned from research to practical applications.


DOE is an important tool for manufacturing companies to understand and optimize their production processes. However, they are time consuming and costly as they have to be conducted in the actual production process. Virtual DOEs are DOEs that run virtually in a software and do not interfere with the actual production process. They can save a lot of time and money, but are non-trivial as they rely on the data collected from the production process and advanced statistical software. Hence, enabling manufacturing companies to run virtual DOEs requires the right choice of data and software.

What data is needed for AI-based root cause analysis?

In manufacturing, one concern unites everyone from line workers to top management: the quality of the goods produced. Continuous improvement of the baseline quality and fast reaction to quality issues are the keys to success. AI-based root cause analysis is the essential tool for effective quality management, and the data is the fuel. However, what kind of data is needed for effective root cause analysis in manufacturing? This article provides an overview.

Quality Metrics

A common hurdle in quality management, surprisingly, lies in establishing a robust quality metric. If we cannot accurately measure how good the quality is, we cannot monitor its stability nor can we judge whether our improvement actions are successful.

Although we learned in Six Sigma training how to qualify a measurement such that it meets the standards of ANOVA’s gauge R&R, it’s not guaranteed that we can set up such a measurement in practice. And even if we have a solid measurement set up, how many of us have ever worked with rock-solid pass/fail criteria? And how many of us have always resisted the temptation of “can’t we just move the spec limits a bit?” when we had to solve a quality issue? The first step towards data-driven root cause analysis should always be to make sure that we have a quality metric that we can trust and that has a fixed target value.

Process Data

The second challenge is a simple but sometimes overlooked fact: the best algorithm will fail to find a root cause if that root cause hasn’t left its traces in the data that we use for the analysis. Collecting a bunch of production data and throwing it into an AI tool can lead to interesting insights, but if the AI only finds meaningless relations, it can well be because there was nothing useful to be found in the data.

In that case it makes sense to take a step back and ask: what kind of issues have we solved in the past? Would these issues have been detectable with the available data? If not, can we add a sensor that records the missing information? Expert knowledge and domain knowledge can often be worked into the data collection by linking data from different sources. The more expert knowledge goes into data collection, the more straightforward it becomes to translate the results of an AI-driven root cause analysis into an improvement action.

Diagram showing where in a production flow the EthonAI Analyst collects input and output data

Linking of Data

Now that we have quality data and process data, they must be linked together. It is not enough to know that the temperature in equipment A was 45°C and the raw material was provided from Supplier B, we need to know which of the products that end up in the quality check were affected by these process conditions. Some manufacturers use unique batch IDs inscribed on their products, some use RFID tags to track them, but sometimes we simply have a linear flow of products without any identification. In this case, we can rely on timestamps and the knowledge of the time delay between process and quality check. There can be some uncertainties in this timestamp matching, but in most cases the AI algorithms are sufficiently robust to handle them.

Routing History

There are many production setups in which multiple machines can perform the same task and, depending on availability, one or the other equipment gets used for a given product. In this case, the routing information is highly valuable data for root cause analysis. Even if the equipment is too old to produce and transmit data about process conditions, the simple fact that the machine was used for many of the failed products can give a crucial hint to the process engineers who can then track down and fix the issue.

Process Sequence 

Lastly, sophisticated root cause analysis tools leverage information on how the products flow through the sequence of process steps to deduce causal relationships and map out chains of effects. Providing these tools with chronological process sequences can rule out irrelevant causal connections, enhancing both the speed and reliability of the analysis.


When embarking on the journey of AI-based root cause analysis in manufacturing, remember these key points: 

  • prioritize a robust quality metric, 
  • integrate expert knowledge in data collection, 
  • establish clear links between process and quality data, 
  • value routing information, 
  • and utilize chronological process information. 

By focusing on these areas, manufacturers can significantly enhance their quality management processes, leading to operational excellence and sustained success.