Methods & Statistics

Faculty of Social Sciences

Utrecht University

**Supervisors
**Prof. P.G.M. van der Heijden (UU)

Prof. B.F.M. Bakker (VU University Amsterdam / CBS)

On July 8th 2016, Susanna Gerritse defended her thesis entitled

**An application of population size estimation to official statistics**

**Summary of thesis**

Official Statistics bureaus are periodically asked to give an estimate of their country’s population, which can be defined by the number of usual residents. A person is considered a usual resident when they have lived in the Netherlands for longer than a year, or if they have the intention to reside for longer than a year. For the Dutch Census, Statistics Netherlands makes use of the Population Register (PR). However, for numerous reasons, immigrants that have taken residence in the Netherlands may not register and become undocumented immigrants. Thus, the PR alone is not sufficient to estimate the number of usual residents, and has an undercoverage considering the number of Dutch usual residents. One commonly used method to estimate population sizes is the capture-recapture methodology. First the PR is linked to two other registers. Then capture-recapture methodology using a covariate that denotes residence duration can be used to estimate the number of usual residents missed by all three registers. However, for the valid use of capture-recapture methodology, a set of assumptions has to be met. Additionally, practical issues such as missing data may occur. Such practical issue have to be resolved before one can estimate the number of Dutch usual residents via capture-recapture methodology. For that purpose there are two central questions answered in this thesis: 1) what is the effect of violated assumptions and missing data on the robustness of population size estimation via capture-recapture methodology, and 2) how can the information gained in 1) be used to achieve a trustworthy estimate of the under coverage of usual residents in the Population Register in the Netherlands? To answer the first question in this thesis, research has been conducted into the robustness of population size estimation via capture-recapture methodology when the following assumptions are violated: 1) independence of the inclusion probabilities of the registers, 2) no erroneous captures in the registers, and 3) perfect linkage of the units in the used registers. For the independence assumption, this research also investigated the robustness for independence conditional on fully and partially observed covariates. Additionally research has been conducted into the effect missing data have on the population size estimation, and most notably how different methods of handling missing data differ in their effect on the resulting population size estimate. It has been found that implied coverage of one register, given the other register is important tot he extent that violated assumptions will bias the population size estimation. Implied coverage plays an important role in this thesis given that it cannot be ascertained from the data whether assumptions are violated, but implied coverage can. The results obtained in answering the first question have been used to conduct research into the undercoverage of the PR of the Netherlands. It is concluded that for reference date september 2010, the PR has an undercoverage of 0,5 to 1,1% usual residents

**Project description**

**The estimation of population size and population characteristics using incomplete registries**

A well known technique for estimating the size of a human population is to find two or more registries of this population to link the individuals in the registries and estimate the number of individuals that occur in neither of the registries (Fienberg, 1972; Bishop, Fienberg and Holland, 1975; Cormack, 1989; International Working Group for Disease Monitoring and Forecasting, 1995). If there are two registries, A and B, ‘being in registry A’ and ‘’being in registry B’ are considered as variables with levels ‘yes’ and ‘no’ and estimation takes place under the assumption that A and B are independent. This is one of the key assumptions and violation may have a substantial impact, in particular when there is little overlap between the registries (see below, in section 3b). One of the approaches to make the impact of a possible violation of this assumption less severe is to include covariates into the model, in particular covariates whose levels have heterogeneous inclusion probabilities for both registries (see Bishop, Fienberg and Holland, 1975). Then loglinear models can be fit to the higher-way contingency table of registries A and B and the covariates. The restrictive independence assumption is replaced by a less restrictive assumption of independence of A and B conditional on the covariates, and subpopulation size estimates are derived (one for every level of the covariates) that add up to a population size estimate.

Recently van der Heijden, Whittaker, Cruyff, Bakker and van der Vliet (submitted) have further developed this approach. Consider a contingency table formed of the two registries and the covariates. They showed that, for specific loglinear models, the contingency table is collapsible over covariates in the sense that the population size estimate will remain unchanged after collapsing the contingency table. To give a simple example, assume that the registries are A and B, the covariate is X and assume that the loglinear model is [AX][B]. In this situation the contingency table of the three variables AxBxX is collapsible over X in the sense that the population size estimate under loglinear model [AX][B] in the table AxBxX is identical to the population size estimate under loglinear model [A][B] in the contingency table AxB. This result is extended by van der Heijden et al. (submitted) to the situation that there are more covariates.

Van der Heijden et al. (submitted) introduce the terminology of *active* and *passive* covariates, i.e. an active covariate is a covariate whose presence in the contingency table has an impact on the estimate of the population size and a passive covariate is a covariate whose presence in the contingency table does not have an impact on the estimate of the population size. In the contingency table AxBxX, when the loglinear model is [AX][B], covariate X is a passive covariate, but when the loglinear model is [AX][BX], then X is an active covariate, because in this latter case the population size estimate under loglinear model [AX][BX] in the three-way array is different from the population size estimate in the two-way contingency table AxB under loglinear model [A][B].

A practical problem in population size estimation studies is that the number of covariates that is available in both registries (or available in the same format) is usually limited to, for example, gender and age. However, this problem is recently solved by Zwane and van der Heijden (2007, see also Van der Heijden, Zwane and Hessen, 2009), who show how to include covariates that are not available in all registries in the loglinear model. If a variable is only available in registry A, then it is missing for those observations that are in registry B but not in A. Zwane and van der Heijden use missing data approaches to estimate these observations. Assume that the set of covariates that is available in registry A is denoted by X1, the set of covariates that is available in registry B is denoted by X2 and the set of covariates both in registry A and B is denoted by X3. Then certain loglinear interaction parameters cannot be identified due to the missing data problem and the so-called saturated or maximal model is [AX2X3][BX1X3][X1X2X3]. Van der Heijden et al. (submitted) show that under this loglinear model all covariates X1, X2 and X3 are active. Interestingly, when X1 and X2 are independent conditional on X3, then X1 and X2 become passive covariates.

One of the advantages of this approach is that characteristics of the hidden population are estimated, under the condition that the above mentioned assumptions are not violated. Thus this approach allows to study the composition of the hidden population.

The aim of this PhD project is to further elaborate this new development.

**Financed by
**Utrecht University / Statistics Netherlands (CBS)