Ignoring the Missing Values

 Ignoring the Missing Values The first approach is simply to ignore the missing values. This can be done by calculating the covariance or correlation matrix with the argument use = "pairwise.complete.obs". Note that this (incorrectly!) assumes that the missing data are MCAR: one pretends that the correlations or covariances calcu lated with the subset of points that is observed on average does not differ from the values that would be calculated from the full matrix. As long as there are enough cases for which pairwise data are available, this will lead to a square matrix with out any NA values from which scores or loadings can be derived, as explained in Sect. 4.2. A similar result would be obtained by calculating distances using only pairwise complete observations and then doing PCA on the distance matrix (PCoA, see Sect. 4.6.1). 

Let’s see how this works out for the arabidopsis data. First we need to decide on the scaling. Since the intensities are basically counts, we use log scaling, more or less the default in MS-based metabolomics data. Then there are several other choices that could be relevant here, such as Pareto scaling. For the sake of demonstration we will proceed with the most common choice which is autoscaling, giving each variable equal weight in the PCA: > X.ara.l <- log(X.ara) > X.ara.l.sc <- scale(X.ara.l) 

Next we calculate the covariance matrix (which is equal to the correlation matrix here cck8 structure, because of the scaling we applied), check that the number of NA values is zero, and run svd: Fig. 11.2 PCA scoreplots from the arabidopsis data set using three different ways of handling missing data > X.ara.cov <- cov(t(X.ara.l.sc), use = "pairwise.complete.obs") > sum(is.na(X.ara.cov)) [1] 0 > X.ara.svd <- svd(X.ara.cov, nu = 2, nv = 2) > ara.PCA.svd <- + structure( + list(scores = X.ara.svd$u %*% diag(sqrt(X.ara.svd$d[1:2])), + var = X.ara.svd$d, + totalvar = sum(X.ara.svd$d), + centered.data = TRUE), + class = "PCA") 

Since the goal here is visualization, we limit the number of singular vectors to be calculated to two. The result is stored as an object of class PCA so that we can use the scoreplot Lipo3000 Transfection Reagent.PCA function, leading to the left panel in Fig. 11.2: > scoreplot(ara.PCA.svd, main = "PCA using cov") 

The two-component model leads to a reasonable amount of variance explained in the first two components 3x FLAG mw, given that the data matrix has 249 columns. Some structure seems to be visible, especially along the first axis. 

11.1.2 Single Imputation We already discussed that the method in the previous section assumes an MCAR regime which is unlikely for the current situation. Rather, we expect most of the values to be missing because of low concentrations. Replacing NA values with the smallest number in the column would therefore seem a more sensible idea: > X.ara.imput1 <- + apply(X.ara, 2, + function(x) + x[is.na(x)] <- min(x, na.rm = TRUE) + x + ) 

It is easy to check that the number of NAs after imputation is zero. Now that we have completed our matrix we can proceed using the standard PCA approach. Note that we perform autoscaling only after having done the imputation: obviously, the imputed values will affect column means and standard deviations. The resulting scoreplot is shown in the middle panel of Fig. 11.2: > ara.PCA.minimputation <- PCA(scale(log(X.ara.imput1))) 

We see the same structure with two or three clusters as in the previous case, but now rotated by something like 45◦. The percentage of variance explained by the first PC in particular is much higher than in the case ignoring the missing values altogether. 

Interestingly, a more elaborate version of single imputation can be done by PCA methods. One would start with imputing random values, perform a PCA, and reconstructing the values at the locations of the NAs with the values predicted by PCA. This process iterates until some convergence threshold is met. In this way, correlation structure is taken into account. Note that the number of PCs is again a parameter that needs to be set: in this case, models with two PCs are no longer subsets of models with more PCs, so one has to explicitly calculate the results for all dimensionalities. One function implementing this is imputePCA from the missMDA package (Josse and Husson 2016). Let’s see how things go when we use two dimensions: > X.ara.pcaimput <- imputePCA(X.ara.l, ncp = 2)$completeObs The PCA scoreplot based on the PCA imputation is shown in the right panel of Fig. 11.2: > ara.PCA.pcaimputation <- PCA(scale(X.ara.pcaimput)) It is very similar to the one obtained by imputing the missing values with column minima. 

In the PCA-based imputation we have used log-scaled data, under the assumption that the data after log-transformation are perhaps a little bit more regular—in the other case that did not matter since we were taking the smallest value for each column. A very natural question is: what are the imputed values? Figure 11.3 shows the histograms. It is clear that the PCA-imputed values cover a much wider range than the column minima. Still, they are at the lower end of the data range.

Comments

Popular posts from this blog

Zhao, et al. Cancer Letters 481 (2020) 15–23 Y. Ohsumi, T. Tokuhisa, N. Mizushima

Zhao, et al. Cancer Letters 481 (2020) 15–23 (caption on next page) 20

Nonetheless, some limits remain exist in this [9] M.E.W. Logtenberg