Multivariate Process Monitoring
Multivariate Process Monitoring Or, put differently, the SD of a sample is the distance to the center, measured in the hyperplane of the PCA projection, and the OD is the distance to this hyperplane. Obviously, both SD and OD depend on the number of PCs. When a sample is above the horizontal threshold it is too far away from the PCA subspace; when it is to the right from the vertical threshold it is too far from the other samples within the PCA subspace. The horizontal and vertical thresholds are derived from χ2 approximations (Todorov and Filzmoser 2009). For the Grignolino data, this leads to the plot in Fig. 11.6: Several of the most outlying samples are indicated with their indices, so that they can be inspected further. Also a biplot method is available, which shows a plot that is very similar to the right plot in Fig. 11.5. Inspection of the data shows that objects 63 and 15 do contain some extreme values in some of the variables—indeed, object 63 is also the object with the smallest score on PC 1 in a classical PCA. However, it would probably be too much to remove them from the data completely.
11.2.2 Discussion A robust approach can be extremely important in cases where one suspects that some of the data are outliers. Classical estimates can be very sensitive to extreme values, and it frequently occurs that only one or very few samples dominate the rest of the data. This need not be an error, because influential observations may be correct, but in general one would put more trust in a model that is based on many observations rather than a few. This is not in contradiction with the desire to build sparse models, as seen in the section on SVMs, for example: there, the sparseness was obtained by selecting only those objects in the relevant part of the space, using all other objects in the selection process.
The robust methods in this section have a wider applicability than just outlier detection: they can be used as robust plugin estimators in classification and regression methods. Robust LDA can be obtained, for example, by using a robust estimate of the pooled covariance matrix; robust QDA by using robust covariances for all classes. PCR can be robustified in several ways, e.g., by applying SVD to a robust covariance matrix estimate; an alternative is formed by regressing on robust scores, for instance from the ROBPCA algorithm. One can even replace the least squares regression by robust regression methods such as least trimmed squares (Rousseeuw 1984). Also robust versions of PLS regression exist (Hubert and Branden 2003; Liebmann et al. 2009). These robust versions of classification and regression methods share the big advantage that one can safely leave in all objects lipofectamine 2000, even though some of them may be suspected outliers: the analysis will not be influenced by only a couple atypical observations. And to turn the question of outliers around: if robust and classical analyses give the same or similar results, then one can conclude that there are no (influential) outliers in the data.
Note that here we have concentrated on identifying whole records as outlying observations, i.e., rows in our data matrix. This is not the only way to approach the issue. One could also say that certain variables, columns in the data matrix, show deviating behaviour. This is a situation, however, that is less likely to wreak havoc: many multivariate methods cck-8 formula, especially in supervised approaches, are geared towards obtaining the optimal weights for each of the variables. If the outlying column would lead to worse results it would probably get a low weight anyway. In unsupervised approaches such as PCA the variable would stand out, and one then can easily identify it as a potential problem and decide how to tackle it. Only in distance- or kernel based methods we would run the risk of obtaining suboptimal results. Finally, it is also possible—in fact, rather likely – that individual data values are grossly incorrect, for whatever reason. Since these have less influence on the overall model than outlying whole rows, in many cases they can be disregarded. However, approaches have been developed recently identifying such cases (Rousseeuw and den Bossche 2018).
R contains many packages with facilities for robust statistics, the most important one probably being robustbase. According to the taskview on CRAN, plans exist to further streamline the available packages, using robustbase as the basic package for robust statistics, and several more specialized packages building on that, such as is the case already for packages like rrcov.
11.3 Multivariate Process Monitoring Robust methods like the PCA methods from the previous paragraph try to focus on the big picture, simply ignoring individual data points that do no conform to the general trend. However, such data points may also contain valuable information, especially when occuring in groups or in a particular order: then cck-8 chemicals, they may point to imminent changes in process conditions that are not always beneficial. The idea is that these deviations, when noticed early enough, can be corrected for by changing appropriate process parameters. In this simple way, a control mechanism can be implemented. Normal operating behaviour is typically defined by expert knowledge and historical data. In industry, statistical process control (Montgomery 2001) has been in use for decades to monitor and control deviations from normal operating behaviour; in pharmaceutical industry it has become known as Process Analytical Technologry (Chanda et al. 2015). Also in Chemometrics much research has been devoted to it over the years (Kourti and MacGregor 1995; Westerhuis et al. 2000; Kourti 2005; Challa and Potumarthi 2013).
A large number of tools are available, often based on simple plots of parameter values or functions of parameter values over time. A well-known example is given by the so-called Western Electric Rules (Western Electric Co. 1956), that provide decision rules for detecting out-of-control samples or non-random variation, e.g., a single point falling outside the 3σ limits, two consecutive points on the same side of the mean outside 2σ limits, or a larger number of consecutive points (often seven, or nine) falling at the same side. Many sets of rules exist, all depending on heuristically defined control limits and/or action limits. What type of action needs to be taken depends on the situation; also for a process that is in control one would expect these rules to be activated quite regularly, so the usual approach is to first investigate the matter more closely and only take further action if something is shown to be clearly wrong.
Comments
Post a Comment