Multiple Imputation
Multiple Imputation Obviously we would like to assess how much our imputed values influence the scores. One way of doing this is to impute multiple times, and plot the scores as a much bigger point cloud. For the imputations, a more elaborate mechanism is needed than simply taking the mean or smallest value per column (repeating such an action would not be very informative). Function MIPCA from package missMDA (Josse and Husson 2016) provides several possible strategies. The default is to start with the imputed matrix from the iterative PCA algorithm in the previous paragraph. Parametric bootstrapping is used to sample residuals (assuming a normal distribution around zero, with the standard deviation given by the empirical standard deviation of the PCA residuals) which are used to generate (by default one hundred) bootstrap samples. Each of these bootstrap samples is then subjected to PCA. Finally, so-called Procrustes analysis is used to rotate all PCA solutions in such a way that the data are maximally overlapping (Josse et al. 2011).
To see how the method is applied we concentrate on the first twenty columns, where the following columns contain missing values: > rownames(X.ara.l) <- rep("", nrow(X.ara.l)) > colnames(X.ara.l) <- paste("V", 1:ncol(X.ara.l), sep = "") > countNAs <- apply(X.ara.l[, 1:20], 2, function(x) sum(is.na(x))) > countNAs[countNAs > 0] V6 V8 V10 V13 V15 V16 V17 V19 V20 220 259 260 223 139 252 16 246 3 We would expect most variability in the variables containing many missing values. Let’s apply MIPCA: > ara.PCA.Minput <- MIPCA(X.ara.l[, 1:20], ncp = 2, scale = TRUE) Fig. 11.4 Multiple imputation for the first twenty columns in the log-scaled arabidopsis data. The plot on the left shows that in each bootstrap sample the PCs are defined in a very similar way. The loading plot on the right shows point clouds for variables containing missing values—again the effect of the imputation seems limited here
The result is an object of class MIPCA which allows a variety of plots. Here we show two of them in Fig. 11.4, created by: > plot(ara.PCA.Minput, choice = "dim", new.plot = FALSE) > plot(ara.PCA.Minput, choice = "var", new.plot = FALSE)
The first one shows the variability in the definition of the principal components in the individual bootstrap sets. Clearly, there is little variation: both PC1 and PC2 remain very much in the same position. The second plot, a loading plot, confirms this: for those variables containing missing values a small point cloud can be seen around the loading arrow indicating the variability of the estimate due to bootstrap sets. For variables without missing values, no variability is observed 3x FLAG storage.
11.2 Outlier Detection with Robust PCA Identifying outliers, i.e., samples that do not conform to the general structure of the data, is a difficult and dangerous task, prone to subjective judgements. Once one has detected an outlier, or even several outliers, the question is what to do: should one remove these before further analysis, downweight their importance in some way, or simply leave them as they are and pay special attention to things like residuals? All are sensible strategies, and could be valid choices depending on the question at hand and the data available. Generally one is advised to not remove outliers, unless there are very good reasons to do so 3xFLAG PEPTIDE glpbio. Very often these reasons are required to include meta-information: deviating numbers in themselves may not be enough reason to remove outlying observations, but if one also knows that there was a power cut in the lab just before that particular measurement, or something else happened that may be related to this sample, then it may be more easy to decide to not take this particular record into account.
At the same time, it is important to realize that outlying samples will occur in practice, also if everything seems to have gone according to plan in the lab: whole microarrays with expressions of tens of thousands of genes can be useless because of some experimental artifact, and including them could be detrimental to the results. One of the problems is that if several outliers are present, they may make each other seem “normal”, an effect that is called masking. Additionally, high-dimensional space, as we know, is mostly empty and every object of a small-to-medium-sized data set can be seen as an outlier. Only if we can assume that the samples are occupying a restricted subspace we may have hopes of performing meaningful outlier detection.
The area of robust statistics (Maronna et al. 2005) is a rich and flourishing field in which methods are studied that are less affected by individual values and will yield consistent results even in the presence of a sizeable fraction of outliers. A typical example of a robust location estimator is the median. Its value will not change if all data points above the median are suddenly ten units higher, or multiplied by a factor of one thousand. It is said to have a breakdown point of 0.5, meaning that half of the data can be “wrong” without affecting the estimate. Higher breakdown points than 0.5 obviously do not make too much sense. Note that the average as an estimator of location has a breakdown point of 0 3xFLAG molecular weight.0: any change to the measurements will lead to a different result. Many classical estimators have robust counterparts, that typically rely on fewer assumptions. The price to pay is usually a lower accuracy or a loss of power: typically one would need more samples to obtain comparable results. Robust methods therefore decrease the influence of outlying observations – interestingly, this makes them also very suited to identify these observations in the first place.
Comments
Post a Comment