Robust PCA
Robust PCA From the above it may seem natural to conclude that a robust form of PCA would be a good candidate to identify outliers in a multivariate data set. In some cases even classical PCA will work: what would be easier than to apply PCA to the data, and see the outliers far away from the bulk of the data? Although this sometimes does happen, and PCA in these cases is a valuable outlier detection method, in other cases the outliers are harder to spot. The point is that PCA is not a robust method: since it is based on the concept of variance, outliers will greatly influence scores and loadings, sometimes even to the extent that they will dominate the first PCs. What is needed in such cases is a robust form of PCA (Hubert 2009). Many different approaches exist, each characterized by their own breakdown point, the fraction of outliers that can be present without influencing the covariance estimates.
The simplest form is to perform the SVD on a robust estimate of the covariance or correlation matrix (Croux and Haesbroeck 2000). One such estimate is given by the Minimum Covariance Determinant (MCD, Rousseeuw 1984), which has a breakdown point of up to 0.5. As the name already implies, the MCD estimator basically samples subsets of the data of a specific size, in search of the subset that leads to a covariance matrix with a minimal determinant, i.e., covering the smallest hypervolume. The assumption is that the outlying observations are far away from the other data points, increasing the volume of the covariance ellipsoid. The size of the subset, to be chosen by the user, determines the breakdown point, given by (n − h + 1)/n, with n the number of observations and h the size of the subset. Unless one really expects a large fraction of the data to be contaminated, it is recommended to choose h ≈ 0.75n. The resampling approach can take a lot of time, and although fast algorithms are available (Rousseeuw and van Driessen 1999), matrices with more than a couple of hundred variables remain hard to tackle.
The MCD covariance estimator is available in several R packages. One example is cov.mcd in package MASS. If we use this in combination with the princomp function, we can see the difference between robust and classical covariance estimation. Let’s focus on the Grignolino samples from the wine data: > X <- wines[vintages == "Grignolino", ] > X.sc <- scale(X) > X.clPCA <- princomp(X.sc) > X.robPCA <- princomp(X.sc, covmat = cov 3xFLAG glpbio.mcd(X.sc)) Visualization using biplots leads to Fig. 11.5: > biplot(X.clPCA, main = "Classical PCA") > biplot(X.robPCA, main = "MCD-based PCA"
There are clear differences in the first two PCs: in the classical case PC 1 is dominated by the variables OD ratio, flavonoids, proanth and tot. phenols, leading to samples 63, 66, 15 and 1, 2, and 3 to having extreme coordinates. In the robust version, on the other hand, these samples have very relatively small PC 1 scores. Rather, they are extremes of the second component, the result of increased influence of variables (inversely) correlated with ash on the first component. Although many of the relations in the plots are similar (the main effect seems to be a rotation), the example shows that even in cases where one would not expect it applying (more) robust methods can lead to appreciable differences.
An important impediment for the application of the MCD estimator is that it can only be calculated for non-fat data matrices, i.e., matrices where the number of samples is larger than the number of variables—in other cases, the covariance matrix is singuar, with a determinant of zero. In such cases another approach is necessary. One example is ROBPCA (Hubert et al. 2005), combining Projection Pursuit and robust covariance estimation: PP is employed to find a subspace of lower dimension in which the MCD estimator can be applied. ROBPCA has one property that we also saw in ICA (Sect. 4.6.2): if we increase the number of PCs there is no guarantee that the first PCs will remain the same—in fact, they usually are not. Obviously, this can make interpretation somewhat difficult, especially since the method to choose the “correct” number of PCs is less obvious in robust PCA than in classical PCA (Hubert 2009). Fig. 11.5 Biplots for the Grignolino samples: the classical PCA solution is shown on the left, whereas the right plot is based on the MCD covariance estimate
Since the details of the ROBPCA algorithm are a lot more complicated than can be treated here, we just illustrate its use. ROBPCA, as well as several other robust versions of PCA, is available in package rrcov as the function PcaHubert. Application to the Grignolino samples using five PCs leads to the following result: > X.HubPCA5 <- PcaHubert(X.sc, k = 5) > summary(X.HubPCA5) Call: PcaHubert(x = X.sc, k = 5) Importance of components: PC1 PC2 PC3 PC4 PC5 Standard deviation 1.766 1.502 1.217 1.039 0.923 Proportion of Variance 0.355 0.257 0.169 0.123 0.097 Cumulative Proportion 0.355 0.612 0.780 0.903 1.000
Note that the final line gives the cumulative proportion of variance as a fraction of the variance captured in the robust PCA model, and not as the fraction of the total variance, usual in classical PCA. If we do not provide an explicit number of components (the default, k=0) the algorithm chooses the optimal number itself: > X.HubPCA <- PcaHubert(X.sc) > summary(X.HubPCA) Call: PcaHubert(x = X.sc) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 1.751 1.537 1.276 1.125 0.9862 0.8695 0.6062 Proportion of Variance 0.294 0.227 0.156 0.121 0.0934 0.0726 0.0353 Cumulative Proportion 0.294 0.521 0.677 0.799 0.8922 0.9647 1.0000 Apparently this optimal number equals seven in this case. The rule-of-thumb to calculate the “optimal” number of components is based on the desire to explain a significant portion of the variance explained by the model (a fraction of 0.8 is used as the default) while not taking into account components with very small standard deviations—the last component of the model should have an eigenvalue at least .1% of the largest one. If the number of variables is small enough, the MCD algorithm is used directly; if not, the ROBPCA algorithm is used. One can force the use of ROBPCA by setting mcd = FALSE. Note that the standard deviations of the first components are not the same as the ones calculated for the five-component model.
The default plotting method is different from the classical plot: it shows an outlier map, or distance-distance map, rather than scores or loadings. The main idea of this plot is to characterise every sample by two different distances: • the Orthogonal Distance (OD) Lipo 3000 Transfection Reagent, indicating the distance between the true position of every data point and its projection in the space of the first few PCs; • the Score Distance (SD), or the distance of the sample projection to the center of all sample projections 3x FLAG price.
Comments
Post a Comment