Posts

Showing posts from November, 2020

Multivariate Process Monitoring

Multivariate Process Monitoring Or, put differently, the SD of a sample is the distance to the center, measured in the hyperplane of the PCA projection, and the OD is the distance to this hyperplane. Obviously, both SD and OD depend on the number of PCs. When a sample is above the horizontal threshold it is too far away from the PCA subspace; when it is to the right from the vertical threshold it is too far from the other samples within the PCA subspace. The horizontal and vertical thresholds are derived from χ2 approximations (Todorov and Filzmoser 2009). For the Grignolino data, this leads to the plot in Fig. 11.6: Several of the most outlying samples are indicated with their indices, so that they can be inspected further. Also a biplot method is available, which shows a plot that is very similar to the right plot in Fig. 11.5. Inspection of the data shows that objects 63 and 15 do contain some extreme values in some of the variables—indeed, object 63 is also the object w...

Robust PCA

 Robust PCA From the above it may seem natural to conclude that a robust form of PCA would be a good candidate to identify outliers in a multivariate data set. In some cases even classical PCA will work: what would be easier than to apply PCA to the data, and see the outliers far away from the bulk of the data? Although this sometimes does happen, and PCA in these cases is a valuable outlier detection method, in other cases the outliers are harder to spot. The point is that PCA is not a robust method: since it is based on the concept of variance, outliers will greatly influence scores and loadings, sometimes even to the extent that they will dominate the first PCs. What is needed in such cases is a robust form of PCA (Hubert 2009). Many different approaches exist, each characterized by their own breakdown point, the fraction of outliers that can be present without influencing the covariance estimates.  The simplest form is to perform the SVD on a robust estimate of...

Multiple Imputation

 Multiple Imputation Obviously we would like to assess how much our imputed values influence the scores. One way of doing this is to impute multiple times, and plot the scores as a much bigger point cloud. For the imputations, a more elaborate mechanism is needed than simply taking the mean or smallest value per column (repeating such an action would not be very informative). Function MIPCA from package missMDA (Josse and Husson 2016) provides several possible strategies. The default is to start with the imputed matrix from the iterative PCA algorithm in the previous paragraph. Parametric bootstrapping is used to sample residuals (assuming a normal distribution around zero, with the standard deviation given by the empirical standard deviation of the PCA residuals) which are used to generate (by default one hundred) bootstrap samples. Each of these bootstrap samples is then subjected to PCA. Finally, so-called Procrustes analysis is used to rotate all PCA solutions in su...

Ignoring the Missing Values

 Ignoring the Missing Values The first approach is simply to ignore the missing values. This can be done by calculating the covariance or correlation matrix with the argument use = "pairwise.complete.obs". Note that this (incorrectly!) assumes that the missing data are MCAR: one pretends that the correlations or covariances calcu lated with the subset of points that is observed on average does not differ from the values that would be calculated from the full matrix. As long as there are enough cases for which pairwise data are available, this will lead to a square matrix with out any NA values from which scores or loadings can be derived, as explained in Sect. 4.2. A similar result would be obtained by calculating distances using only pairwise complete observations and then doing PCA on the distance matrix (PCoA, see Sect. 4.6.1).  Let’s see how this works out for the arabidopsis data. First we need to decide on the scaling. Since the intensities are basically coun...

Chemometric Applications

 Chemometric Applications This chapter highlights some typical examples of research themes in the chemometrics community. Up to now we have concentrated on fairly general techniques, found in many textbooks and applicable in a wide range of fields. The topics in this chapter are more specific to the field of chemometrics, combining elements from the previous chapters. In particular, latent-variable approaches like PCA and PLS exhibit a wide range of applications (some people have criticized the field of chemometrics of being too preoccupied with latent-variable methods, and not without reason—on the other hand such tools are extremely handy in many different situations) cck8 mw .  To start with, we come back to the problem of missing values. Hard to avoid in many real-life applications, they often prevent the standard application of statistical methods. One example is PCA—the svd-based implementation in Chap. 4 does not allow missing values. We will discuss a coupl...