PCA with Optimal Scaling and Regularization
Principal Components Analysis (PCA) has many purposes; a.o., it is used for dimension reduction, visualization, scale construction and noise reduction. Depending on the application, one of these goals is the focus of the analysis and thus drives the considerations involved. One of these considerations is the dimensionality of the solution that one should choose. Application of PCA in social sciences is limited as in this field the data that are collected are often strictly not one of a real numbers. Demographic data, for example, are often categorical, that is data without a natural ordering or distance between categories. The oft used Likert scales are ranking scores, that is scores with a natural ordering, but not necessarily equal distance between subsequent values. Homogeneity Analysis also known as Multiple Correspondence Anal- ysis (MCA)) allows dimension reduction of these types of data. The extension of PCA into PCA with Optimal Scaling will be called OS-PCA. OS-PCA allows us to analyse categorical data in a PCA fashion. Categorical data are optimally scaled (transformed) such that the lower dimensional representation given by the OS-PCA analysis contains as much information of the correlational structure of the transformed data as possible. The transformations also allow us to look at nonlinear relationships between the variables, or the variables and the principal components.
PCA has been extended to high dimensional data settings by introducing the concept of regularization. In high dimensional data settings, the risk of overfitting a model to the data is increased. These methods are based on analogies to ridge and the lasso penalties in regression. As in regression, these penalties require tuning. We study two different ways to extend OS-PCA. Since OS-PCA includes transformations of the variables, the risk of overfitting, compared to linear PCA, is increased. The first way is to look at regularization of the transformations. The second is to look at regularization of the component loadings.
Prof. J. Meulman
Leiden University / IBM SPSS
22 January 2016 – 21 January 2020