Debby ten Hove

fotodebbytenhoveSocial and Behavioral Sciences
Child Development and Education
University of Amsterdam

Personal webpage Debby ten Hove


A comprehensive framework for estimating and interpreting interrater reliability for dependent data
The proposed research aims to solve several problems concerning the interpretation and estimation of Interrater reliability (IRR).

The first problem involves the different conceptualizations of IRR. We can roughly divide the available IRR coefficients into those that conceptualize IRR in terms of counting (dis)agreements (e.g., Cohen 1960, 1968; Krippendroff, 1970, 2004) and those that conceptualize IRR in terms of variance components (e.g., Shrout & Fleiss, 1979). These different IRR coefficients show a variety of IRR estimates when applied to the same data (Li, Yi, & Andrews, 2018; ten Hove, Jorgensen & van der Ark, 2018), which is due both to differences in their specific formulas and to differences in their underlying conceptualizations of IRR. We argue that researchers should select IRR coefficients based on the underlying conceptualization of IRR and its usefulness in a research setting. Although many researchers report an IRR coefficient[1], the implications of different IRR conceptualizations for a research setting are largely overlooked. For example, we know that reliability provides information regarding measurement precision due to attenuation and inflated power of statistical tests but the implications of different conceptualizations of IRR for subsequent analyses are unclear. In Project 1, we aim to link the conceptualization of IRR behind the most often reported IRR coefficients ( i.e., Cohen, 1960, 1968; Hayes & Krippendorff, 2007; McGraw & Wong, 1996; Shrout & Fleiss, 1979)[2] to underlying theories on reliability (e.g., classical test theory, generalizability theory) and investigate their implications in a research context.

The second problem concerns the interpretation of IRR. Several benchmarks are available to assess the magnitude of IRR, of which some are extremely popular (e.g., Landis & Koch, 1977, are cited over 10,000 times). These benchmarks are problematic because factors such as the number of raters, number of response categories, or marginal distribution of the responses impact the sampling distributions of IRR coefficients. Moreover, these formulated benchmarks cannot be useful for all available IRR coefficients because different IRR coefficients show a variety of IRR estimates when applied to the same data. Absolute benchmarks to interpret the magnitude of IRR are therefore not useful, and the value of an IRR coefficient should be assessed by criteria that make sense for the research context, such as loss of power or measurement precision due to attenuation. In Project 2, we will conduct a small-scale literature review to select the most popular benchmarks for IRR coefficients. In a first simulation study, we will investigate how several benchmarks relate to each other and to the different IRR coefficients under specific conditions that are shown to be related to the magnitude of IRR. Also, we will use example data to show more meaningful ways to interpret IRR, such as calculating the implied loss of power and measurement precision. In a second simulation study, we will test how the estimates of the most often reported IRR coefficients affect statistical power and measurement precision of individual assessments, using design factors such as the magnitude of IRR, effect size, (marginal) data distributions, numbers of raters, and numbers of subjects.

The third problem concerns IRR for dependent data. In standard two-level designs, with subjects (level-one) nested within clusters (level-two), level-one data are often modelled at both the subject level and at an aggregated level. This level-two aggregate can even have its own qualitative interpretation. For data obtained by raters, this implies two facets of theoretical interest of which a researcher would want to know the IRR (for a similar discussion concerning test reliability see Geldhof, Preacher, & Zyphur, 2014). None of the existing conceptualizations of IRR theorizes IRR for these distinct facets of theoretical interest, although ignoring them results in less informative estimates. In Project 3, we will provide a conceptual framework on IRR for two-level nested data, based on Generalizability theory (GT), and provide models to estimate IRR for this type of data and check the properties of the resulting model in a Monte Carlo simulation study.

Problem four involves the more specific case of independent observers rating interdependent network data, in which the interdependent actor and partner scores (i.e., dyads of people are nested within both actors and partners) imply multiple interdependent facets of theoretical interest (i.e., actor, partner and dyad components) of which a researcher might want to know the IRR. IRR should thus be conceptualized for the different components in interdependent network data and estimation methods are needed to accommodate this complex data structure: ratings nested within dyads nested within both actors and partners. In Project 4, to establish a definition of IRR for interdependent network data, we will extend the social relations model (SRM; Kenny, 1996; Kenny & Lavoie, 1984) approach for variance decomposition of network data within the framework of GT. We will define IRR within the framework of GT by means of ICCs and specify a model for estimating the variance components and check the estimator’s properties in a simulation study.

Problem five involves discrete data. Ratings of subjects are often of nominal or ordinal measurement levels, which complicates estimation procedures for IRR coefficients founded on variance components. In Project 5, we will extend the models for two-level nested data and interdependent network data to models that handle discrete nested data.
For each project, we will explain and illustrate the newly proposed concepts and measures using empirical data, and we will make our computer software publicly available.

[1] On November 6, 2018, the term “Interrater reliability” yielded > 300,000 hits in Google Scholar.

[2] On November 6, 2018, Google Scholar showed 30,259 citations of Cohen (1960), 6435 citations of Cohen (1968), 2279 citations of Hayes & Krippendorff (2007),  4191 citations of McGraw & Wong (1996), and 17,573 citations of  Fleiss & Shout (1979).

Prof. dr. L.A. van der Ark, dr. T.D. Jorgensen

Financed by
The Graduate School of Child Development and Education (University of Amsterdam); Research Institute of Child Development and Education (University of Amsterdam)


1 September 2018 – 31 August 2022