Methodology and Statistics
Social and Behavioral Science
Prof. J.K. Vermunt & Dr M.C. Kaptein
On October 13th 2017, Lianne Ippel will defend her thesis entitled
Multilevel Modeling for Data Streams with Dependent Observations
My PhD thesis concerns the estimation of well-known statistical models in a context where data are ‘streaming in’ over time. Estimating models in such a situation can be troublesome because some models require estimation methods which use all data in memory, for instance a multilevel model. When new data keep streaming in, estimating certain statistical models becomes infeasible, because when new data present themselves the models have to be re-estimated. Redoing the analyses every time a new data point enters is inefficient and time consuming and over time becomes infeasible. In this dissertation, I introduce a commonly used approach to deal with data streams: online learning, a method to update the result of an analysis while the data enter. Additionally, I developed an algorithm that updates rather than reestimates the model parameters of a multilevel model to deal with repeated measurements of individuals, a common data structure found in data streams. This algorithm (called SEMA, Streaming Expectation Maximisation Approximation) allows researchers to analyse data streams while keeping the nested or grouped structure of the data stream into account.
Streaming Estimation of Response Heterogeneity
In social science, we often encounter hierarchical or grouped data (e.g. observations within individuals). Taking this grouped structure into account enables the estimation of an effect per unit (e.g. per person). During this project we will study the estimation of these individual level effects, while the data enter. This is especially advantageous in interactive web applications. These applications often gather extremely large datasets and require a short computation time. The existing methods to estimate individual level effects are insufficient, because the estimation procedure is lengthy and/or they require all data in computer memory which can become problematic due to memory limitations. We will use methods that do not require all data in memory, because they use summations such as an average which can be updated a point at a time.
1 October 2013- 31 August 2017