Chemometric Applications
Chemometric Applications This chapter highlights some typical examples of research themes in the chemometrics community. Up to now we have concentrated on fairly general techniques, found in many textbooks and applicable in a wide range of fields. The topics in this chapter are more specific to the field of chemometrics, combining elements from the previous chapters. In particular, latent-variable approaches like PCA and PLS exhibit a wide range of applications (some people have criticized the field of chemometrics of being too preoccupied with latent-variable methods, and not without reason—on the other hand such tools are extremely handy in many different situations) cck8 mw.
To start with, we come back to the problem of missing values. Hard to avoid in many real-life applications, they often prevent the standard application of statistical methods. One example is PCA—the svd-based implementation in Chap. 4 does not allow missing values. We will discuss a couple of alternatives, e.g., replacing the missing values with estimated values. Ironically, PCA is one of the methods that can be used to obtain these estimates... Another form of PCA,robust PCA, is an attractive method to identify outliers in multivariate space, at modest computational cost. It is very often a good idea to check whether a robust alternative (if it exists) leads to results that are close to what one sees in the main analysis: if that is not the case, one should really try to find the cause(s) of the differences and then decide what to do. Robust estimates also play a role in the next topic, statistical process control, which is very important in industrial applications. Again, multivariate approaches based on distances or dimension reduction firmly place this topic in the chemometrics area. Continuing the theme of finding ways to combat flaws in our data lipo 3000, Orthogonal Signal Correction and its combination with PLS, OPLS, provide ways to remove irrelevant variation in the data—irrelevant for predicting purposes, that is. In some cases this leads to simpler models that are easier to interpret. In analytical laboratories, there is often a need to develop calibration models that can be transferred across a range of instruments. One example is to develop a model using a laboratory, high-quality setup, and then to apply the model for in-line measurements of a much lower quality. The approach to achieve this has become known as calibration transfer. Finally, we take a look at a decomposition of a matrix X where the individual components are directly interpretable 3x FLAG storage, e.g., as concentration profiles or spectra of pure compounds: Multivariate Curve Resolution.
11.1 PCA in the Presence of Missing Values
Real-life data sets nearly always contain missing values, i.e., data points for which no value has been recorded. Data analyses often cannot handle these missing values, and the regular approach is to replace the missing values with some hopefully appropriate estimate, and do the analysis on the completed data. This process is usually referred to as imputation, and is often repeated many times (multiple imputation, using different imputed values) to decrease the influence of the artificial values. In analytical chemical applications, a common cause for missing values is the detection or quantification limit of the measurement device: concentrations may simply be too low to lead to a measurable response. In other cases, values may be missing because of non-responses in surveys, errors in data processing (e.g., misalignments in LCMS data), temporary breakdown of a sensor, or simply because of some random event—there are countless possible reasons why a data point is missing.
That does not mean it is not important to think about reasons for missingness; in fact, ignoring this is dangerous and can easily lead to false conclusions. Missing values being caused by measurements below the detection limit form a good example. We know that in these cases the true but unknown value should be somewhere between zero and the limit of detection. That is, even the missing values contain information. The simplest possible approach in such a case would be to pick a value somewhere in the middle and use that instead of all missing values. In many applications, such an approach is too simple, and leads to an awkward peak around the imputed value in the distribution of the variable. A better strategy is to try to estimate—on the basis of the non-missing values for a particular variable—the parameters of the distribution of a variable, and then draw randomly from that distribution to complete the data set (Uh et al. 2008). Again, this can be done multiple times, allowing to assess the effect of the imputed values on the analysis. An overview of many different ways of imputing data can be found, e.g., in Little and Rubin (2019).
In case there is reason to believe that the data are missing completely at random (a term so important in the field that the acronym MCAR is often used) life becomes simpler. MCAR means that there is no relation between the values of the data and the missingness status. This is clearly not the case for the detection limit example. Also the term missing at random (MAR) is used, and although this again implies that there is no relation between the values and the missingness status, it is different from MCAR in that missingness may depend on non-missing values. As a hypothetical example, hospital lab tests may show more missing values for obese patients than for patients with a lower BMI. This obviously makes it important to take into account these dependencies when imputing. The final category is missing not at random (MNAR), e.g., corresponding to the measurements below the detection limit leading to missingness needs to be taken into account in the analysis.
Many methods are available to handle missing values in a general context (Little and Rubin 2019). To name just two: the analysis may be based on only the complete cases, which may work well when the number of missing values is limited. It does run the risk of strongly biased results. Alternatively, missing values may be replaced by adequate values such as means, or estimated using methods like regression or the EM algorithm mentioned in the context of model-based clustering, described in Sect. 6.3—there, the unknown class label is basically treated as a missing value. Several R packages are available for more general situations. The mice package (van Buuren and Groothuis-Oudshoorn 2011), for example, assumes that data are MAR, and uses regressions to estimate missing values from the other variables. The name stands for Multivariate Imputation via Chained Equations. Categorical as well as numerical values are allowed. Another package for multiple imputation, amelia (named after Amelia Earhart, the first woman to fly across the Atlantic Ocean solo who went missing over the Pacific in 1937), also assumes MAR data but in addition assumes multivariate normal data (Honaker et al. 2011).
In the remainder of this paragraph we will focus on a couple of ways to perform PCA in the presence of missing data. First of all, we could sacrifice some of the functionality of PCA and use simple tricks allowing us to use the incomplete matrix anyway. Second, we could replace the missing values by something that makes sense, in the case of MCAR data perhaps a mean value. In all cases, it is probably wise to eliminate rows or columns that contain too many missing values – finding an optimal cutoff here is a trial-and-error process which will depend strongly on the application.
Let’s look at the arabidopsis data from ChemometricsWithR, an LC-MSbased metabolomics data set on a number of samples of Arabidopsis thaliana, a popular model organism in plant sciences. As usual in this kind of data, many values are missing—the total number per variable is shown in Fig. 11.1. In the following, we will only retain variables with less than 40% missing values: > data(arabidopsis) > naLimitPerc <- 40 > naLimit <- floor(nrow(arabidopsis) * naLimitPerc / 100) > nNA <- apply(arabidopsis, 2, function(x) sum(is.na(x))) > naIdx <- which(nNA < naLimit) > X.ara <- arabidopsis[, naIdx]
This leads to a matrix containing 249 columns, less than half of the number of original variables. We expect a large majority of missing values to be caused by metabolites being too low in concentration to be measured.
Comments
Post a Comment