Big Data in the Social Sciences: Statistical methods for multi-source high- dimensional data
Social science research has entered the era of big data: Many detailed measurements are taken and multiple sources of information are used to unravel complex multivariate relations. For example, in studying obesity as the outcome of environmental and genetic influences, researchers increasingly collect survey, dietary, biomarker and genetic data from the same individuals. Such novel integrated research can inform us on health strategies to prevent obesity. Although linked more-variables-than-samples (called high-dimensional) multi-source data form an extremely rich resource for research, extracting meaningful and integrated information is challenging and not appropriately addressed by current statistical methods. A first problem is that relevant information is hidden in a bulk of irrelevant variables with a high risk of finding incidental associations. Second, the sources are often very heterogeneous, which may obscure apparent links between the shared mechanisms. Hence, a statistical framework is needed to select the relevant groups of variables within each source and link them throughout data sources. Simultaneous component analysis methods are particularly powerful for high-dimensional data. In this project I will contribute to the development of a new framework by extending simultaneous component analysis to allow for the identification of common components defined by relevant clusters of variables in multi-source high-dimensional data.
Dr K. van Deun
Prof. J. K. Vermunt
NWO Vidi Grant K. van Deun 2015
1 September 2016 – 1 September 2020