Skip to content

The effect of unmeasured root causes in problem-solving

“What if I am not measuring all the potential root causes?” is a question we frequently encounter from industry experts. While it’s important to have comprehensive data for problem-solving, capturing every root cause is unfeasible. This article illustrates that robust algorithms for root cause analysis can uncover significant production issues even amidst other unexplained effects.

Why the real world differs from theory

Any root cause analysis starts with aggregating the right set of data. In an ideal world, we would measure all variables that influence an outcome variable of interest (e.g., quality). Such holistic data coverage would enable an AI-based analysis to identify all drivers of production issues. However, reality often presents challenges with some root causes not being measurable or reflected in the data. For instance, consider a missing temperature sensor in an injection molding machine or unmeasured sources of particles in a semiconductor fabrication process. In such scenarios, even the most advanced AI algorithms cannot directly reveal the unexplained root causes.

There is a common misconception that for AI-based root cause analysis to be effective, the data must be perfect. This is not the case. While it is true that unmeasured variables limit the ability to make process improvements, useful insights can still be gained from the data that most manufacturers collect today. The presence of unexplained variation does not preclude the value of such analyses. Imperfect models can still enhance process understanding. In this article, we will explore an example demonstrating how, despite the absence of some sensors, robust algorithms are capable of reliably identifying key root causes amidst unexplained variation.

Simulated production setup

We introduce a practical case for our root cause analysis by simulating data for five sensor measurements and a quality metric defined as yield. Our simulation aims to uncover root causes of yield losses using data from these sensor measurements across 10,000 production batches. The relationship between the sensor measurements and the yield is captured by the following formula:

Here’s a breakdown of the above formula:

  • The ideal value of Sensor 1 measurement is 100. Deviations from this value reduce the yield.
  • The ideal value of Sensor 2 measurement is 20. Deviations from this value reduce the yield.
  • The ideal value of Sensor 3 measurement is 50. Deviations from this value reduce the yield.
  • Sensor 4 measurement and Sensor 5 measurement have no impact on the yield.

The below figure displays the distributions of the five sensor measurements and the production yield. Our goal is to identify sensor measurements that cause yield variation by utilizing the EthonAI Analyst software. The EthonAI Analyst employs causal algorithms to pinpoint the root causes behind production issues. Importantly, we approach this analysis as if the above ground-truth function linking sensor measurements to yield is unknown.

In the following, we will systematically omit sensor measurements from our dataset to observe any changes in root cause analysis outcomes. This approach tests the robustness of our analysis, ensuring it can still accurately identify all measured effects. Despite the inability to account for unmeasured sensors, we demonstrate that even models with incomplete data can significantly improve our understanding of the process.

Numerical experiments for unmeasured root causes

In the first scenario, we will investigate the situation where all sensors are operational. Here, the EthonAI Analyst should be able to detect all root causes accurately as all information is contained in the dataset. Upon analyzing the data, the EthonAI Analyst presents a ranking of sensor measurements based on their impact. It can be observed that the first three sensor measurements are correctly identified as root causes, whereas Sensor 4 measurement and Sensor 5 measurement get no attribution in the analysis. Therefore, the attained root cause model is an accurate approximation of the ground truth relationships.

In the next scenario, we will remove Sensor 1 from our dataset without changing the rest of the data. Therefore, the effect of Sensor 1 measurement will result in unexplained variation in the yield. However, a good root cause analysis should still detect the measurements of Sensor 2 and Sensor 3 as root causes of yield losses. As can be seen in the below root cause ranking, the EthonAI Analyst still gives the same weight to Sensor 2 measurement and Sensor 3 measurement. Also the magnitude of the effect is close to the previous root cause model, where the entire variation in the yield could be explained.

In the final scenario, we will remove both Sensor 1 and Sensor 2 from our dataset without changing the rest of the data. Now two out of the three root causes cannot be explained, which results in a large portion of the variation to be unexplained. We analyze the data with the EthonAI Analyst and still get the expected results. In particular, Sensor 3 measurement is detected as a root cause and its magnitude is comparable to the one in the root cause model where the entire variation could be explained.

Conclusion

This article has demonstrated that comprehensive data collection is crucial for effective root cause analysis, but it’s not necessary to measure every variable to begin the process. Robust algorithms can uncover significant production issues even when faced with incomplete data. The real world often presents challenges where certain root causes remain unmeasurable or unaccounted for, such as missing sensors. Despite these limitations, AI-based analyses can still provide valuable insights, which enhances process understanding and facilitates KPI improvements.

Through numerical experiments, we have illustrated the effectiveness of the EthonAI Analyst software in identifying root causes, even when sensor data is systematically removed. Our simulations revealed that the EthonAI Analyst accurately identified key root causes in scenarios where all sensors were operational, as well as in scenarios where sensors were deliberately omitted. Importantly, the Analyst’s ability to maintain accurate root cause models, even with incomplete data, underscores its reliability in real-world production settings.

In our experience, a significant portion of problems can be addressed by the existing data manufacturers collect today. Initiating root cause analysis early not only aids in problem-solving but also guides decisions regarding sensor deployment. Often, unmeasured relationships can be approximated using proxies (e.g., machine IDs). For example, adding routing information (i.e., how individual units flow through production) can already point process experts to the sources of problems (e.g., towards suspicious machines). Our advice is clear: don’t wait for perfect data before embarking on data-driven analysis. Start with what you have, and progressively enhance data coverage and quality to drive continuous improvement in your manufacturing processes.

Deploying a Manufacturing Analytics System: On-premises vs. cloud-based solutions

A Manufacturing Analytics System (MAS) integrates across data sources and provides valuable insights into production processes. As companies evaluate their options, a key decision emerges: should they deploy the MAS onto their own premises, or opt for a cloud-based Software as a Service (SaaS) solution?

This article discusses the merits of each approach to help businesses make an informed decision. It focuses on five major discussion points: data security, scalability, maintenance, cost effectiveness, and support.

Data Security and Compliance

On-Premises: Tailored to Specific Needs

The primary advantage of on-premises deployments lies in the enhanced control and security it offers. Companies with highly sensitive data often prefer on-premise solutions due to their stringent security requirements. It can be easier to conform to stringent or inflexible policies by hosting the MAS internally. This setup allows for a more hands-on approach to data management, ensuring compliance with standards like GDPR, HIPAA, NIST, or other industry-specific regulations.

Cloud-Based Solutions: Robust, Standardized Security

Cloud-based MAS solutions have often been perceived as less secure, and some companies generally distrust the cloud. However, especially in recent years, cloud offerings have evolved significantly. Reputable cloud providers employ robust security measures, including advanced encryption, regular security audits, and compliance with various international standards. They have the resources and expertise to implement and maintain higher levels of security than individual organizations can achieve on their own. For businesses without the capacity or desire to manage complex security infrastructure, a cloud-based MAS offers a secure, compliant, and hassle-free alternative.

Scalability on Demand

On-Premises: Tailored to Specific Needs

An on-premises MAS deployment allows for extensive customization. Businesses can tailor the system to their specific IT and OT landscape, including guaranteed real-time responses. This capability is particularly beneficial for companies requiring deep integration with legacy systems and factory equipment. On the other hand, scaling on-premises solutions typically requires significant investment in hardware and infrastructure, as well as the technical expertise to manage these expansions.

Cloud-Based Solutions: Easy Scalability and Flexibility

Cloud-based MAS platforms shine in scalability. They allow businesses to scale their operations up or down with ease, without the need to invest in physical infrastructure. This scalability makes cloud solutions ideal for businesses experiencing rapid growth or fluctuating demands. Furthermore, cloud platforms are continually updated with the latest features and capabilities, ensuring businesses always have access to the most advanced tools without additional investment or effort in upgrading systems. A potential down-side is that ultimate control of the deployment lies with the cloud provider, which can be a hurdle for highly regulated industries.

Maintenance and Updates

On-Premises: Hands-On, Resource-Intensive Maintenance

Maintaining an on-premises MAS can require a dedicated IT personnel to manage hardware, perform regular software updates, and troubleshoot issues. This hands-on approach offers complete control over the maintenance schedule and system changes, but can be resource-intensive. Companies who already have specialized IT teams due to the nature of their operations may find this approach a natural fit.

Cloud-Based Solutions: Hassle-Free, Automatic Updates

Cloud-based solutions significantly reduce the burden of maintenance. The service provider typically manages all aspects of system maintenance, including regular updates, security patches, and technical support. Automatic updates ensure that the system is always running the latest software version, providing access to new features and improvements without additional effort or cost. This allows businesses to focus on their core operations, without the need to allocate and manage resources for system maintenance.

Cost Effectiveness

On-Premises: Higher Initial Investment but Predictable Long-Term Costs

Deploying any system on-premises typically involves a higher initial capital expenditure, including costs for hardware, software licensing, and installation. Over the long term, these costs can be more predictable, or at least there are no cloud-subscription fees to factor in. For organizations with the necessary infrastructure already in place, this model can be cost-effective, particularly when considering the longevity and stability of the investment.

Cloud-Based Solutions: Lower Upfront Costs with Ongoing Expenses

Cloud-based MAS solutions offer lower initial costs and much quicker setup compared to on-premise installations. Businesses can avoid significant expenses on hardware and infrastructure. This subscription model converts upfront investments into ongoing operational expenses. In addition to the ease of setup, this can be more cost-effective in the short term. However, for businesses with long-term predictable usage patterns, it is important to consider the cumulative costs over an extended period.

Support

On-Premises: Customized and Direct Control

This model of deployment demands a significant commitment of internal resources for maintenance and troubleshooting, necessitating dedicated, skilled IT personnel. While on-prem provides an unmatched level of control and customization, as discussed earlier in this post, the reliance on in-house capabilities for supporting the MAS can be a considerable burden on manufacturing customers.

Cloud-Based Solutions: Broad, Expert Support with 24/7 Availability

Cloud-based MAS solutions boast a scalable, expert support structure, alleviating the need for an in-house IT team to manage the MAS deployment. This is particularly important for operations spread across multiple locations or time zones. Automatic updates and maintenance conducted by the provider ensure the system remains up-to-date without any additional effort from the customer side. Furthermore, troubleshooting is accelerated in a cloud-based system because the infrastructure is standardized and uniform. This consistency reduces complexity and variability, which significantly improves the efficiency and speed of support services.

Conclusion

The choice between deploying a MAS on-premises or in the cloud depends on various factors including data security needs, customization requirements, budget constraints, network reliability, and maintenance capabilities. Each option has its merits, and the decision should align with the specific operational, financial, and strategic objectives of the organization. At EthonAI, we offer both options to meet our customers’ needs effectively.

A story of why causal AI is necessary for root cause analysis in manufacturing


Traditional machine learning is designed for prediction and often struggles with root cause analysis. The article presents a short story demonstrating how causal AI overcomes this problem.

Why causality is needed for decision-making

Data-driven decisions are paramount to stay competitive in today’s manufacturing. However, for effective decisions, we need tools that transform data into actionable insights. Traditional machine learning tools, while great for predictions, fall short in decision-making due to their inability to grasp cause and effect relationships. They fail to understand how different decisions impact outcomes. To make truly informed decisions, understanding these cause and effect dynamics is crucial.

Causal AI provides manufacturers with entirely new insights by going beyond the prediction-focused scope of traditional machine learning. It seeks to uncover the causes behind outcomes, which enables us to assess and compare the outcomes of different decisions. This offers crucial insights for more informed root cause analysis. For manufacturers, this means not only predicting what will happen, but which decision can be taken now that leads to a better outcome in the future.

What is causal AI?

Causal AI, at its core, is an advanced form of artificial intelligence that seeks to understand and quantify cause-and-effect relationships in data. In particular, causal AI aims to understand how one variable A influences another variable B. This is important for decision-making, since if we want to change A with the goal to increase B, we need to know how A influences B. Traditional machine learning only uses A to predict B, but cannot answer what happens to B if we change A as we will see in an example below. However, the answer to this question is important for decision-making, in particular in the context of root cause analysis in manufacturing.

This article looks into the task of root cause analysis for quality improvement. The focus is to maximize “good” quality and minimize “bad” quality outcomes. Simply predicting when quality will drop is not enough in this setting. The objective is to identify and adjust specific production parameters (like adjusting a machine setpoint) when bad quality is observed, to restore good quality. Therefore, understanding the cause-and-effect relationships between these production parameters and the product quality is key. This knowledge allows us to pinpoint which parameters are causing quality issues and make necessary changes to achieve desired quality levels consistently. In the following, we tell a short story to demonstrate the capabilities of causal AI in this context.

Causal AI for root cause analysis

Let’s imagine a manufacturing company specializing in plastic Christmas trees, a seasonal product where quality and timeliness are key. The company faced a peculiar challenge: a noticeable drop in the quality of their plastic trees. Naturally, they turned to data for answers.

Their initial investigation was led by a skilled data scientist, who collected data about the production process. The production process consists of two steps: First, the plastic branches are sourced from a supplier. Second, the branches are put through a machine which attaches the branches to the trunk. There are two possible suppliers, A and B, and two possible machines, M1 and M2.

The data scientist used traditional machine learning techniques, which focused on predicting the quality based on the collected data. This led to an intriguing conclusion: The machine learning model suggested that machine M1 produced worse quality than M2. Based on this analysis, the data scientist recommended stop using the machine M1, which would lead to a substantial reduction in throughput and, hence, reduced production capacity. However, the story took a twist when the company decided to scrutinize both machines. To their astonishment, there was no recognizable difference in the settings of the machines or the machines themselves. This puzzling situation called for a deeper analysis, beyond what traditional machine learning could offer.

Luckily, a friend of the company’s data scientist is a renowned causal AI expert. The expert developed a tailored causal AI algorithm for the production process, seeking not just good predictions, but to understand the underlying cause-and-effect relationships in the production process. The causal AI model revealed an unexpected insight: the root cause of the quality drop was not the machine, but the supplier. In fact, it revealed that Supplier A delivered branches of worse quality than Supplier B. After talking to the factory workers, the company found out that the workers always put the branches of Supplier A through machine M1 and the branches of Supplier B through machine M2. They did this simply because the machines were closer to the boxes with the corresponding branches. Hence, all the low-quality branches of Supplier A ran through machine M2, which made machine M2 look like it is causing the drop in quality. 

But why did the traditional machine learning model fail to identify the true root cause? The reason is that its objective is prediction and, for this, knowing which machine the branches went through was enough to predict the quality perfectly. In particular, since the traditional machine learning model didn’t understand the underlying cause-and-effect relationships, it simply used all available parameters. However, by doing so, it also used the machine as a parameter, which, in this example, is a so-called mediator. By using this mediator, it “blocked” any indirect influence from the supplier via the machines. As a result, the influence of the supplier got lost. Since the causal AI understood the underlying cause-and-effect relationships, in particular the relationship between supplier and machine, it could correctly identify the true root cause.

Armed with this causal insight, the company informed Supplier A about the quality of their branches, which they ultimately were able to improve with new specifications. As such, leveraging causal AI averted a prolonged production stop of machine M1, which would have cost the company a lot of money. All of this just because the traditional machine learning model focuses on prediction, but not on understanding the underlying cause-and-effect relationships. Only a causal AI model could identify and rectify the true root cause of the quality issue.

In this simplified scenario, it would be easy to carefully check all parameters and production steps manually. But imagine a real-world scenario, in which we have hundreds or even thousands of parameters across many process steps. In such a setting, the clear association between machine M1 and quality, identified by traditional methods, can easily be mistaken for a root cause. And manually checking for other influence factors would be tedious, if not impossible. In this case, causal AI can identify the root cause immediately and, as such, saves a lot of time and costs.

Opportunities and challenges of causal AI in manufacturing

The opportunity of causal AI is clear: it offers new ways for manufacturing to identify the true root causes of problems. This depth of insight empowers manufacturers to make decisions that address core issues, leading to enhanced efficiency, quality, and competitive advantage. 

However, the adoption of causal AI is challenging. One significant hurdle is the absence of off-the-shelf software, which can be used without being a data scientist. Moreover, as the above example showed, even seasoned data scientists often lack the experience with causal AI. This is mainly because causal AI is a relatively new field. Despite these challenges, the potential gains in operational understanding and performance are substantial. 

If you’re interested in finding out how causal AI can help your problem-solving efforts, we invite you to book a demo and experience the impact firsthand.

What are distributional shifts and why do they matter in industrial applications?

This article examines three types of distributional shifts: covariate shift, label shift, and concept shift, using a milling process simulation for downtime prediction. It emphasizes the need to monitor data distributions to maintain good model performance in dynamic environments like manufacturing.

What is distributional shift?

“Shift happens” – a simple yet profound realization in data analytics. Data distributions, the underlying fabric of our datasets, are often subject to frequent and often unexpected changes. These changes, known as “distributional shifts,” can dramatically alter the performance of machine learning (ML) models. 

A core assumption of ML models is that data distributions do not change between the time you train a ML model and when you deploy it to make predictions. However, in real-world scenarios, this assumption often fails, as data can be interconnected and influenced by external factors. An example of such distributional shifts is how ML models went haywire when our shopping habits changed overnight during the pandemic. 

There are three primary types of distributional shifts: covariate shift, label shift, and concept shift. Each represents a different way in which data can deviate from expected patterns. This article explores these forms of distributional shifts in a milling process, where we assess the performance of ML models in a downtime prediction task.

Simulation setup

In order to illustrate the impact of distributional shifts in industrial settings, we set up a simulation for predicting machine downtime in a milling process. Our goal is to determine whether a machine will malfunction during a production run, which helps anticipate capacity constraints proactively.

Our primary variable of interest is “Machine Health,” a binary indicator where:

  • Machine Health = 0 indicates a machine breakdown.
  • Machine Health = 1 indicates the machine is running flawlessly.

Several covariates are considered to be potential predictors of Machine Health. These are:

  • Operating Temperature (°C)
  • Lubricant Level (%)
  • Power Supply (V)
  • Vibration (mm/s)
  • Usage Time (hours)

The relationship between these covariates and the Machine Health is encapsulated in the below ground-truth function. This function dictates the conditions under which a machine is likely to fail:

Here’s a breakdown of the formula:

  • If the Operating Temperature is below 50°C or above 60°C there is a machine breakdown
  • If the Lubricant Level is below 8% or above 10% there is a machine breakdown
  • If the Power Supply is below 210V or above 230V there is a machine breakdown
  • If the Vibration is above 3 mm/s there is a machine breakdown
  • If the Usage Time is above 20 hour there is a machine breakdown

To simulate this process, we generate data for 1000 production runs based on the above ground-truth function. A Random Forest classifier is then trained on this data. Note that the above ground-truth function is assumed to be unknown when making predictions. The classifier’s task is to predict potential breakdowns in future production runs based on the observed covariates.

No shift

Moving forward with our simulation, we now introduce a “no shift” scenario to evaluate the robustness of our trained Random Forest classifier under conditions where there is no distributional shift. We generate an additional 250 production runs, which serve as our test set during model deployment. These new runs are created under the same conditions and assumptions as the initial 1000 production runs used for training the model. It’s important to note that we are deliberately not introducing any changes to the underlying data distribution in this particular example. This allows us to evaluate how well the classifier performs when the test data closely mirrors the training data.

Below, we present a comparative visualization of the variables’ distributions between model training and model deployment (i.e., predicting the Machine Health for the test set). The figure offers a clear perspective on the consistency of data across both the training set and the test set. By keeping the distribution of Operating Temperature, Lubricant Level, Power Supply, Vibration, and Usage Time identical to the training phase, we aim to replicate the ideal conditions under which the ML model was originally trained.

In this no-shift scenario, our Random Forest classifier achieves an accuracy of 100%. This result is not surprising, as the test data was created with the same data generation function as the training data. The classifier effectively recognizes and applies the patterns it learned during training, which leads to flawless predictions. This no-shift scenario serves as a benchmark against which we will compare the model’s performance in subsequent scenarios involving different types of distributional shifts.

Covariate shift

We now explore the concept of covariate shift, a common challenge in the application of ML models in real-world settings. Covariate shift occurs when there is a change in the distribution of the covariates between model training and model deployment (Ptrain(x) ≠ Ptest(x)), while the way these covariates relate to the outcome remains the same (Ptrain(y|x) = Ptest(y|x)). 

To demonstrate the effect of covariate shift, we simulate a scenario for our simulated downtime prediction task. We assume that our factory has received a substantially large order, which requires us to extend the Usage Time on our milling machine. This increase in Usage Time naturally leads to higher Operating Temperatures and a decrease in Lubricant Levels, which alters the data distribution for these specific covariates.

We simulate an additional 250 production runs under these modified conditions. This new data serves as our test set, which now differs in the distribution of key covariates (Operating Temperature, Lubricant Level, and Usage Time) from the original training set. Below, we visualize the differences in distributions of these covariates between model training and model deployment.

When applying our previously trained Random Forest classifier to this new test set, we observe a significant drop in accuracy, with the model achieving 72% accuracy. This decrease from the 100% accuracy seen in the no-shift scenario clearly illustrates the challenges posed by covariate shift. The model, trained on data with different covariate distributions, struggles to adapt to the new conditions, which leads to a noticeable reduction in its predictive accuracy.

Our results demonstrate the importance of monitoring covariate shifts in dynamic environments. Detecting these shifts is crucial, but in high-dimensional scenarios where multiple covariates may shift together, tracking individual features becomes a challenging task. Compared to our simulated environment, real-world applications may involve hundreds of covariates. To address this high complexity, one can turn to dimensionality reduction techniques like t-SNE (t-distributed Stochastic Neighbor Embedding).

t-SNE is a nonlinear dimensionality reduction technique that can be used to visualize high-dimensional data in a low-dimensional space. Such visualizations provide a clear perspective on how the data is distributed across different dimensions. The t-SNE plot below demonstrates this for our milling process, where the five covariates are reduced to two dimensions, while aiming to retain most information. It can be observed that the training data (gray) and testing data (blue) form distinct clusters, which visually indicates a covariate shift. This separation highlights the need to reassess our model’s predictive reliability under these new conditions.

Label shift

Next, we delve into label shift, another form of distributional shift. Label shift occurs when the distribution of the labels changes between modeling training and model deployment (Ptrain(y) ≠ Ptest(y)), while the conditional distribution of the covariates given the label remains constant (Ptrain(x|y) = Ptest(x|y)).

To illustrate an example of label shift, we simulate a test set for another milling machine, which, compared to the first milling machine, is more prone to breakdowns. This change increases the likelihood of machine failures, thus altering the distribution of the Machine Health label. We generate data for 250 production runs under these new conditions, where the probability of breakdowns (Machine Health = 0) is higher than in our initial training dataset.

In the figure below, we visualize the distribution of the covariates and the Machine Health label across the training and testing data sets. The visual comparison clearly shows a shift in the label distribution, with a substantially higher frequency of breakdowns in the test data. Note that there is also a shift in the Vibration covariate. Here it is assumed that Vibration is a symptom of machine breakdowns. Following the definition of a label shift, the causal effect here is from the label (i.e., Machine Health) to the covariate (i.e., Vibration).

We now apply the Random Forest classifier to this new dataset and find that the model achieves an accuracy of 86%. This result marks a clear decrease from the perfect accuracy observed in the no-shift scenario. Despite being trained on data where breakdowns were frequent, the model now faces a scenario where breakdowns are more common, which deteriorates its predictive accuracy.

This example highlights the need for models to be adaptable to changes in the label distribution, especially in dynamic environments where the frequency or nature of the target variable can vary over time. Detecting label shifts is more straightforward than identifying covariate or concept shifts. Regular examination of the class label distributions is a key approach to ensure they accurately represent the deployment environment.

Concept shift

Concept shift, or concept drift, is the third form of distributional shifts that we will investigate. It is characterized by changes in the underlying relationship between covariates and labels over time (Ptrain(y|x) ≠ Ptest(y|x)). Such shifts mean that a model’s predictions may become less accurate as the learned relationships become outdated in the face of new data dynamics.

To illustrate concept shift, we introduce a new context to our simulation. Specifically, we assume that following multiple machine breakdowns we introduced a new maintenance routine for our milling machines. This new routine affects the relationship between our covariates and the Machine Health label. With better maintenance, the milling machines can operate for longer periods without breakdowns, thereby altering the way the Usage Time covariate relates to the Machine Health label. The new maintenance routine adjusts our ground-truth formula as follows:

We simulate 250 production runs with the updated maintenance routine for the test set. The distributions of the covariates and the Machine Health label in the training and testing datasets are depicted below. This deployment setting reflects the new reality where the relationship between the Usage Time and Machine Health has shifted due to improved maintenance practices. Despite similar covariate distributions, it can be observed that the number of machine breakdowns has been significantly reduced. 

The Random Forest classifier, which was initially trained under different operational conditions, now encounters data where the ground-truth relationship between variables has fundamentally changed. When applying the classifier to this new data, we observe an accuracy of 84%. This decrease from the no-shift scenario demonstrates the impact of concept shift on the model’s predictive accuracy. 

Detecting concept drift is challenging because it can often appear gradually. A general strategy to detect this form of distributional shift is to systematically monitor the performance of ML models over time. This continuous assessment helps in pinpointing when the model’s predictions start deviating from expected outcomes, suggesting that the underlying relationships between covariates and labels might have changed.

Conclusion

This article highlights a fundamental challenge in real-world applications of ML: data is constantly changing, and models must be adapted accordingly. We conducted simulations of machine downtime in a milling process to showcase the challenges posed by covariate, label, and concept shifts. In an ideal no-shift scenario, our model achieved perfect accuracy, but this quickly changed under real-world conditions of shifting data distributions.

Conventional ML models, like Random Forests, are increasingly used for industrial applications. While these methods have been made accessible to a wide audience through open source libraries (e.g., scikit-learn), they are often blindly deployed without fully assessing their performance in dynamic environments. We hope this article prompts practitioners to prioritize model monitoring and regular model retraining as key practices for preserving long-term model performance.

Readers interested in learning more about distributional shifts in manufacturing can find a case study from Aker Solutions in our recent paper, published in Production and Operations Management.

Industrial anomaly detection: Using only defect-free images to train your inspection model

This article explains why it is important to use an inspection approach that does not require images of defective products in its training set, and what kind of algorithm is suited in practice.

Requirements for visual quality inspection

Industrial anomaly detection in quality inspection tasks aims to use algorithms that automatically detect defective products. It helps manufacturers achieve high quality standards and reduce rework.

This article focuses on industrial anomaly detection with image data. Modern machine learning algorithms can process this data to decide whether a product in an image is defective. To train such algorithms, a dataset of examplary images is needed.

An important feasibility criterion for manufacturers is the way these training datasets need to be compiled. For instance, some naive algorithms require large datasets to work reliably (around one thousand images or more for each product variant). This is expensive and often infeasible in practice. That’s why we only consider so-called few-shot algorithms that work reliably with a low number of examples, specifically much less than one hundred images. 

Another aspect that distinguishes algorithms is whether examples of defective products are needed. Here, we can broadly distinguish two classes of algorithms: (1) “generative algorithms” that can learn from just  normal (or defect-free) products, and (2) “discriminative algorithms” that require normal and anomalous (or defective) images.

This is an important distinction for two reasons. First, anomalies are often rare, and when the manufacturing of new product variants starts up, no defective data is available for training. Secondly, by definition “anomalous” is everything that is not normal, which makes it practically impossible to cover all possible anomalies with sufficient training data. The latter is the more important argument, so let’s look at it in more detail.

Figure 1: Example from PCB manufacturing. The green connectors on the bottom of the PCB need to be mounted correctly as shown in (a). Possible defects are misplaced, missing or incorrect connectors. Examples of missing connectors are shown in (b), (c), and (d).

Figure 1 illustrates this. The single example in (a) should already give you a good impression of the concept of “normal.” By contrast, the training images in (b) and (c) are by no means sufficient to define the concept for “anomalous” (e.g., other defect types such as discolorizations or misplacements are not represented).

Choosing the right type of algorithm

To better understand how discriminative and generative models differ when applied to anomaly detection, we use the PCB example in Figure 1 to construct a hypothetical scenario. For the sake of simplicity, a discriminative algorithm can be thought of as a decision boundary in a high-dimensional feature space. Each image becomes a point in that space, and lies either on the “normal” or the “anomalous” side of the boundary. Figure 2 simplifies this even further, down to a two-dimensional feature space. Such algorithms look at the training data of the two classes (normal and anomalous) and try to extract discriminating features when constructing the decision boundary. As such, these algorithms are likely not robust on unseen and novel defect types.

Figure 2: The two subfigures show a simplified two-dimensional feature space of a discriminative model. The dashed line is the decision boundary of the model after training. The dots correspond to training and test images, where green means defect-free and red means defective. Dots with a black border were in the training set, the others were not. The letters refer to the images in Figure 1. (a) Only contains images that were used to construct the discriminative model (training images). (b) Contains both training and test images, highlighting the difficulties of a discriminative model to generalize to all types of anomalous images.

To see how a discriminative algorithm fails in practice, recall that anomalous is everything that is not normal, and consider that normal images tend to occupy only a small volume in the wide feature space. By contrast, the surrounding space of anomalous images is vast. It is thus very unlikely to gather sufficiently numerous and different examples of such images for training.

In the example of Figure 2, the training images (with a black outline) happen to cover just the lower part of the space, and the resulting decision boundary is good at making that distinction. But it does not encode the fact that defective products can also lie further above, to the right, or to the left – which is where the unseen example 1(d) happens to lie.

Figure 2 illustrates the problem with discriminative models, when defect types are not part of the training set. The decision boundary may end up working well on the training data, but previously unseen defects can easily end up on the “normal” side of the boundary. Concretely in this example, the image 1(d) happens to be closer in feature space to the non-defective images than the defective images 1(b) and 1(c).

For this reason, we strongly advocate to use algorithms that focus on learning the concept of normality instead, and can thus be trained solely from normal images. Such algorithms can also benefit from defective images in their training set, in order to improve robustness to specific types of defects, but crucially, they do not require them. Using ML terminology, we seek industrial anomaly detection algorithms that explain how normal data is generated, as opposed to discriminating normal from anomalous images. Such models can represent the generative process behind normal data. This can be used to judge whether or not an image could have been created via this generative process. If not, then the image is anomalous.

Conclusion

The Inspector offered by EthonAI provides a state-of-the-art solution for manufacturers to the problem of visual inspection. The EthonAI Inspector performs anomaly detection with generative algorithms that can be trained with just a few defect-free images. This is a great advantage in manufacturing environments, where gathering images is expensive, especially if examples of defects need to be in the training data. In addition, the nature of the algorithms that we deploy are robust towards unseen defects, as outlined above. We constantly observe that customers can uncover new defect types in their manufacturing process that they were unaware of before. This significantly improves the quality assurance process as a whole.

Generative modeling (or generative AI) has seen tremendous successes in the past years. It is expected that the usage of such models will continue to grow in manufacturing and help set new quality standards. Most real-world scenarios require knowledge on how normal images are generated, including factors of allowed variations such as lighting and position. EthonAI will continue to push the limits of such algorithms, and help you ensure that you don’t ship defective products to your customers.

A terrible idea: Using Random Forest for root cause analysis in manufacturing

Over the past years, our interactions with industry experts have revealed a significant trend for root cause analysis in manufacturing. As the data coverage in modern factories increases, we are observing a growing number of manufacturers who adopt “off-the-shelf” machine learning (ML) algorithms in favor of traditional correlation-based methods. Despite the initial promise of conventional ML algorithms, like Random Forests, this article demonstrates that they can provide seriously misleading conclusions.

Feature importance is the wrong proxy for root causes

Many manufacturers are trending towards the use of Random Forests for root cause analysis. There’s been an observable shift toward tree-based algorithms to understand the relationship between a set of production parameters and undesirable production outcomes. A popular method involves training a Random Forest to establish predictive relationships, and then utilizing “feature importance” to assess how strongly each production parameter (e.g., temperature) predicts a certain outcome of interest (e.g., quality losses). The key assumption here is that highly predictive parameters are also important for explaining the underlying production problems that need to be addressed. However, the core fallacy of this approach lies in the fact that Random Forests are designed for prediction tasks, which fundamentally differ from the objectives of root cause analysis in manufacturing.

So, why is the use of Random Forests for root cause analysis a bad idea? At its core, the issue is that predictive power should not be confused with causal effects. Root cause analysis aims to identify factors that significantly affect the financial bottom line. However, as we demonstrate in this article, the parameters deemed most predictive by a feature importance analysis don’t necessarily align with these critical factors. Moreover, a factory represents a structured flow of processes. Conventional ML algorithms neglect a factory’s process flow by simplifying it to mere tabular data. This oversimplification can have severe consequences where important causal relationships are entirely overlooked.

In the following, we’ll delve into the limitations of using Random Forests for root cause analysis. We’ll use two straightforward examples from a hypothetical cookie factory, which aim at uncovering the root causes of quality problems. In the first example, we demonstrate that Random Forests are sensitive to outliers and their feature importance overemphasizes the relevance of rare events. In the second example, we show that Random Forests are incapable of exploring root cause chains and thus fail to uncover the true sources of quality problems. Additionally, we’ll demonstrate how EthonAI’s graph-based algorithms effectively address these shortcomings.

The cookie factory

Let’s introduce a practical use-case for our analysis: a cookie factory. In the figure below, you’ll find the layout of this factory, from incoming goods to final quality control. The cookie factory is designed to produce orders in batches of 100 cookies each. While the overall setup is a simplification, it effectively captures the essence of a real-world production environment. Our focus here is to understand how various parameters interact and how they relate to the overall quality of the cookies. To this end, we’ll generate synthetic data based on two different scenarios.

Our cookie factory’s process flow begins with the arrival of raw ingredients. Flour, sugar, butter, eggs, and baking powder are supplied by two different providers, labeled Supplier A and Supplier B. To maintain a standard in each batch, a production order of 100 cookies exclusively uses ingredients from one supplier. Though the ingredients are fundamentally the same, subtle variations between suppliers can influence the final product.

Next is the heart of the cookie factory – the baking process. Here, the ingredients are mixed to form a dough, which is then shaped into cookies. These cookies are baked in one of three identical ovens. Precise control of the baking environment is key, as even minor fluctuations in temperature or baking duration can significantly impact cookie quality. For every batch of cookies, we record specific details: the oven used (Oven_ID), the average temperature during baking, and the exact baking duration. These data points provide valuable insights into the production process.

The final stage in our factory is quality control, which is conducted by an automated visual inspection system. This system spots and rejects any defective cookies – be they broken or burnt. We’ll use “yield” as a quality metric. Yield is defined as the share of cookies in one production order that meet our quality standards and are ultimately delivered to the customer (e.g., if 95 out of 100 cookies pass quality control the yield equals to 95%).

In our subsequent analyses, we’ll dissect how production parameters like Supplier_ID, Oven_ID, Temperature, and Duration influence the quality of our cookies. Our goal is to explore the interplay of these parameters that determines why some cookies make the cut while others have to be thrown away.

For our upcoming examples, we’ll simulate production data from our cookie factory. For this, we will create synthetic data for our production parameters, namely the Supplier_ID, Oven_ID, Temperature, and Duration. Additionally, we have to establish a “ground-truth” formula that models the relationship between these parameters and the cookie quality. We model the quality of each cookie production batch based on the following two parameters: the Temperature and the Duration. We’ll use the following formula to simulate the yield for each production batch:

Here’s a breakdown of the above formula:

  • The ideal baking temperature is set at 200° Celsius. Deviations from this temperature reduce the yield.
  • Similarly, the ideal baking duration is 20 minutes. Any deviation from this time affects the yield negatively.

Let’s consider an illustrative example: Imagine a cookie batch is baked at 210° Celsius for 22 minutes in Oven-1 using ingredients from Supplier A. The yield calculation would be: 100 – 5 (Temperature deviation) – 5 (Duration deviation) = 90%. This means 90 cookies pass quality control and 10 cookies are thrown away. Note that the above formula represents the actual modeled relationships in our scenario, but is assumed to be unknown for our root cause analysis.

Scenario 1: Predictive modeling is not the right objective for root cause analysis

In this first scenario, we’ll expose a critical weakness of Random Forests: their tendency to overfit outliers and, thus, to overestimate the relevance of rare events. While Random Forests can quantify predictive power via feature importance, they ignore the frequency and magnitude of each parameter’s financial impact. Our subsequent example with simulated data sheds light on this critical problem.

Setting the Simulation Scenario for Scenario 1

We start by simulating a dataset with 500 production batches based on our ground-truth quality formula. In order to demonstrate that Random Forests are very sensitive to outliers, we introduce a significant data imbalance into our dataset: only the first batch uses raw ingredients from Supplier B, whereas the remaining 499 batches use ingredients from Supplier A. Furthermore, we assume that after the first batch was produced, one of the cookie factory’s employees accidentally dropped the entire batch to the floor, resulting in all the cookies breaking. Consequently, the first batch of cookies has a yield of 0%. This exaggerated incident is specifically designed to highlight Random Forests’ sensitivity to outlier events. Such outlier events happen rarely, and hence, over an extended period, they impact the quality only minimally. Moreover, since those outlier events are often due to human error, they may not be avoidable. As such, a good root cause analysis should not identify the outlier event or anything that is spuriously correlated with it.

Here’s a snapshot of the simulated dataset:

Root cause analysis with Random Forest and feature importance

We now use a Random Forest model to analyze the above dataset of 500 production batches. The model is trained to predict the yield based on four parameters: Supplier_ID, Oven_ID, Temperature, and Duration. Subsequently, we compute the feature importance of each parameter to identify which of them is the most predictive of the overall cookie yield.

The feature importance analysis identifies Supplier_ID as the most predictive parameter with a score of 0.60, followed by Temperature at 0.21, and Duration at 0.19. This ranking suggests that the supplier has the largest effect on cookie quality. Knowing how the data was generated, which is that the Supplier_ID is unrelated to the quality, we can immediately establish that this finding is wrong. In fact, the Random Forest attributes the Supplier_ID with high feature importance, because Supplier B was only used once for the first production batch (i.e., Batch_ID = 001), which has a yield of 0%, because it was accidentally dropped to the floor. Hence, the Random Forest erroneously placed high importance on this outlier event and identified the Supplier_ID to be the reason for this event, which is also incorrect.

Root cause analysis with EthonAI Analyst

We now apply the EthonAI Analyst software to conduct the same analysis for the first scenario. The EthonAI Analyst makes use of graph-based algorithms that have been particularly designed for root cause analysis in manufacturing. One of their key abilities is to account for the frequency of root causes, which identifies the production problems that truly matter for cost reduction.

Upon analyzing the data, the EthonAI Analyst presents a ranking of parameters based on their impact. It attributes high importance to Temperature and Duration, thereby recognizing these as the primary factors affecting cookie quality. Notably, the Supplier_ID is deemed inconsequential because the EthonAI Analyst effectively avoids the overfitting issues encountered with the Random Forest. This demonstrates the importance of accounting for both the frequency and the impact of root causes to effectively identify the parameters that have a consistent effect on quality.

Scenario 2: A factory cannot be represented by tabular data

In our second scenario, we show how the inability of Random Forests to accurately model process flows leads to missed opportunities for quality improvement. Random Forests operate fundamentally different to the established problem-solving methodologies that are used in manufacturing. For example, the 5-Why method involves a procedure of repeatedly asking “why” to trace a root cause to its origin. However, since conventional ML algorithms treat factories as static data tables rather than dynamic processes, they fail to employ the backtracking logic that is essential for analyzing root cause chains.

Setting the Simulation Scenario for Scenario 2

We again simulate a dataset of 500 production batches with the same quality formula as before. Unlike the previous scenario, both suppliers A and B now supply a similar amount of ingredients across production orders. To illustrate how Random Forests fail to account for ripple effects throughout a factory, we add an additional complexity to the dataset. Specifically, we introduce a calibration issue in the baking process, which affects the temperature measurements of Oven-1, Oven-2, and Oven-3. This creates a root cause chain where the Oven_ID indirectly affects the yield by influencing the temperature. Below we visualize the temperature distributions across the different ovens, which show that Oven-2 and Oven-3 deviate more from the optimal temperature of 200° Celsius than Oven-1.

Root cause analysis with Random Forest and feature importance

We again use a Random Forest to analyze the new dataset of 500 production batches. As before, the model is trained on Supplier_ID, Oven_ID, Temperature, and Duration to predict the resulting yield. We then compute the feature importance to determine the most predictive production parameters.

The model identifies Temperature as the most significant parameter with a feature importance of 0.64, followed by Duration at 0.35. Notably, both Oven_ID and Supplier_ID have a feature importance of 0.00, implying they have no impact on the yield. However, since we know the underlying data generation process, we can confirm that the lack of feature importance attributed to Oven_ID is incorrect. This error occurs because the model fails to capture how Oven_ID indirectly affects yield through the Temperature parameter.

Root cause analysis with EthonAI Analyst

We now repeat the analysis for the second scenario with the EthonAI Analyst software. Unlike Random Forests, the EthonAI Analyst employs our proprietary algorithms that capture process flows by modeling them as a causal graph. This helps identifying complex root cause chains and tracking production problems back to their actual root cause.

Examining the results from the EthonAI Analyst presents a contrasting view to the Random Forest’s results. Like the Random Forest, it identifies Temperature and Duration as critical parameters. However, the EthonAI Analyst correctly recognizes Oven_ID as a significant parameter too. This is clearly illustrated in the extracted graph, which reveals Oven_ID’s indirect influence on yield through the Temperature parameter.

Conclusion

Random forests have become popular for root cause analysis in manufacturing. However, our article demonstrates they have serious limitations. In two simple scenarios with just four production parameters, we demonstrate that Random Forests fail to accurately identify the root causes of simulated quality losses. The first scenario showed their tendency to confuse predictive power with financial impact. The second scenario illustrated their inability to trace chains of root causes. This raises an important concern: Are conventional ML algorithms like Random Forests reliable enough when it comes to analyzing hundreds of production parameters in complex factories? Our findings suggest they are not.

Moving forward, we advocate for the adoption of graph-based algorithms. Compared to conventional ML algorithms, they provide more accurate insights and identify the problems that truly hit the bottom line. We hope this article inspires professionals to pursue more robust and effective root cause analysis in their factories. If you’re intrigued and want to explore the capabilities of graph-based algorithms, we encourage you to book a demo and see the difference for yourself.

What data is needed for AI-based root cause analysis?

In manufacturing, one concern unites everyone from line workers to top management: the quality of the goods produced. Continuous improvement of the baseline quality and fast reaction to quality issues are the keys to success. AI-based root cause analysis is the essential tool for effective quality management, and the data is the fuel. However, what kind of data is needed for effective root cause analysis in manufacturing? This article provides an overview.

Quality Metrics

A common hurdle in quality management, surprisingly, lies in establishing a robust quality metric. If we cannot accurately measure how good the quality is, we cannot monitor its stability nor can we judge whether our improvement actions are successful.

Although we learned in Six Sigma training how to qualify a measurement such that it meets the standards of ANOVA’s gauge R&R, it’s not guaranteed that we can set up such a measurement in practice. And even if we have a solid measurement set up, how many of us have ever worked with rock-solid pass/fail criteria? And how many of us have always resisted the temptation of “can’t we just move the spec limits a bit?” when we had to solve a quality issue? The first step towards data-driven root cause analysis should always be to make sure that we have a quality metric that we can trust and that has a fixed target value.

Process Data

The second challenge is a simple but sometimes overlooked fact: the best algorithm will fail to find a root cause if that root cause hasn’t left its traces in the data that we use for the analysis. Collecting a bunch of production data and throwing it into an AI tool can lead to interesting insights, but if the AI only finds meaningless relations, it can well be because there was nothing useful to be found in the data.

In that case it makes sense to take a step back and ask: what kind of issues have we solved in the past? Would these issues have been detectable with the available data? If not, can we add a sensor that records the missing information? Expert knowledge and domain knowledge can often be worked into the data collection by linking data from different sources. The more expert knowledge goes into data collection, the more straightforward it becomes to translate the results of an AI-driven root cause analysis into an improvement action.

Diagram showing where in a production flow the EthonAI Analyst collects input and output data

Linking of Data

Now that we have quality data and process data, they must be linked together. It is not enough to know that the temperature in equipment A was 45°C and the raw material was provided from Supplier B, we need to know which of the products that end up in the quality check were affected by these process conditions. Some manufacturers use unique batch IDs inscribed on their products, some use RFID tags to track them, but sometimes we simply have a linear flow of products without any identification. In this case, we can rely on timestamps and the knowledge of the time delay between process and quality check. There can be some uncertainties in this timestamp matching, but in most cases the AI algorithms are sufficiently robust to handle them.

Routing History

There are many production setups in which multiple machines can perform the same task and, depending on availability, one or the other equipment gets used for a given product. In this case, the routing information is highly valuable data for root cause analysis. Even if the equipment is too old to produce and transmit data about process conditions, the simple fact that the machine was used for many of the failed products can give a crucial hint to the process engineers who can then track down and fix the issue.

Process Sequence 

Lastly, sophisticated root cause analysis tools leverage information on how the products flow through the sequence of process steps to deduce causal relationships and map out chains of effects. Providing these tools with chronological process sequences can rule out irrelevant causal connections, enhancing both the speed and reliability of the analysis.

Conclusion

When embarking on the journey of AI-based root cause analysis in manufacturing, remember these key points: 

  • prioritize a robust quality metric, 
  • integrate expert knowledge in data collection, 
  • establish clear links between process and quality data, 
  • value routing information, 
  • and utilize chronological process information. 

By focusing on these areas, manufacturers can significantly enhance their quality management processes, leading to operational excellence and sustained success.