Danielle McCool

Utrecht University
Department of Methodology and Statistics

In 2022, Danielle McCool defended her thesis ‘Addressing missing data in human movement trajectories‘ at Utrecht University.

Project

Using surveys and smartphone sensors to produce time use and travel statistics

This project considers as its broad focus the integration of mobile device sensors with more traditional survey methods with the ultimate goal of developing methodology for alignment between modes for official statistics. There are many areas of research that are concerned withbehavior that exists in a continuous context but which have relied upon either sampling or recall methods to approximate the underlying behavior of interest. As an illustrative example, consider travel movement behavior. General patterns of travel behavior may be of interest to a governmental body, who may require aggregate statistics to make governmental infrastructure decisions. Given access to every individual’s precise location, no statistical methods would be required for making statements about the number of persons using public transportation on a given week. Conceding that we have access neither to all members of the population, nor to their precise location at each moment in time, these official travel statistics are usually generated both by generalizing intelligently across samples of persons as well as samples of time. This is often accomplished by studies in diary format that require respondents to recall all trips made during a day.

This current methodology, known as the travel diary study, has been well established over decades of studies, but comes with known compromises. Because they are based on self-recall data, they are often biased, with respondents misreporting trip characteristics or rounding times and distances for ease of calculation. Furthermore, they are cumbersome and require no small amount of data entry, which decreases participation and is also known to increase the likelihood that those who do participate will leave out smaller trips.  By including sensor data in the methodology, we aim to address these problems by increasing the precision and frequency of location measurement and by extending the survey instrument to access larger swaths of the population by reducing deployment and participation costs. The reduced burden also enables us to track participants’ habits over longer periods of time, which allows for more fine-grained answers to questions about travel behavior.

While sensor data seem to offer many fields much in the way of increased quantity and precision of data, there are known issues. Location data is susceptible to errors at multiple levels. Data can be missing for a range of reasons related either to personal or device characteristics. The expectation is that missing data from various sources will require differential treatment that will also impact the method of addressing the other missing data. Consider the difference, for example, between complete unit nonresponse versus incidentally leaving one’s phone at home versus a device dying mid-trip. Although all three are sources of missingness, the overall effect on outcome measures is likely to be quite different, and the methodology employed to address each will vary as well. It would be impossible for us to impute the route through space taken by a person for whom no data has been collected. However, the missing spatial data for dropped measurements on either side of a tunnel could very well be imputed on the basis of known map characteristics. It may also be possible to impute longer stretches of missing data, given that a person travels a single route with some frequency. One aim of this project is to determine the limits of what we are able to do with missing data, given the additional dimensions of space and time inherent in semi-continuous sensor data.

The first two articles arising from this project seek to evaluate experimental microdata collected in a field test of a travel diary application in November and December 2018 at Statistics Netherlands. The combination of register data for two-thousand participants invited to the study and the data generated by over six hundred respondents will be used in order to develop methodology for identifying, categorizing and reducing the impact of missing and erroneous data. Further evaluations investigating the extent to which the collected sensor data differs from the data collected in previous official surveys may also form part of this project in a separate article. In line with the overall goals of evaluating and improving the use of sensor data, a future two papers are proposed on the integration of sensor data in time use studies where the issues with the usage of geospatial data to provide more detailed information will require similar modeling and correction for nonresponse but may add new features such as utilizing location coordinates to evaluate and correct for missing annotative data.

Supervisors

Prof. dr. Barry Schouten, dr. Peter Lugtig

Financed by

Waarneem-Innovatie Netwerk (CBS/UU)

Period

15 July 2018 – 15 July 2022