u/CogitoErgoOverthink

I’ve been researching the problem of test-retest reliability for a while now and I’m curious how others are handling the identifiability issues that come with longitudinal data.

In psychology we are usually taught that retest reliability is a simple correlation between two time points. The problem is that this assumes the underlying trait is perfectly stable and the measurement error is completely random. In my opinion these assumptions are basically impossible for real world data because even the most stable traits usually only correlate at about 0.6 to 0.8 over time.

I recently published a paper in Applied Psychological Measurement where I demonstrated that when these assumptions are not exactly met the resulting retest coefficient is entirely uninterpretable. Moreover, these assumpions are also not testable, since the framework is essentially a black box. A simple correlation cannot tell you if a low score means your scale is noisy or if your participants actually changed, because you only ever observe two knowns, but have more than two unknowns.

I am definitely not alone in this critique. A paper that came out earlier this year by Tufiş, Alwin, and Ramírez in the Journal of Survey Statistics and Methodology reaches a similar conclusion using GSS survey data. They argue it is a bit of a Catch-22 where we rely on these coefficients because they are easy to calculate even though the math is often fundamentally uninterpretable for most psychological and sociological constructs.

The classic fix for this is the Heise 1969 framework. If you have three waves of data Heise showed you can algebraically separate reliability from stability using the three observed correlations. It is a neat trick but as I’ve dug into it the limitations are pretty glaring. It requires constant measurement precision across waves and a strict Markovian process for trait change. More importantly with only three waves these assumptions are mathematically untestable so you are basically just trading one set of blind assumptions for another.

I am looking to move past the 1960s-era CTT math on this. I am wondering if anyone here has found success using more modern latent trait models or SEM-based approaches to reliably differentiate trait stability from measurement error. Specifically, I want to know how people are actually implementing Latent State-Trait models when they don't have massive multi-indicator datasets. Are there Bayesian or Dynamic SEM approaches that allow us to identify these components without needing a ridiculous number of waves? I would love to hear if there is a better modern standard I should be looking at that moves beyond the Heise framework.

My paper: https://journals.sagepub.com/doi/full/10.1177/01466216251401213

The Tufiş et al. 2024 paper: https://academic.oup.com/jssam/article/12/4/1011/7484622

I’ve been researching the problem of test-retest reliability for a while now and I’m curious how others are handling the identifiability issues that come with longitudinal data.

My paper: https://journals.sagepub.com/doi/full/10.1177/01466216251401213

The Tufiş et al. 2024 paper: https://academic.oup.com/jssam/article/12/4/1011/7484622

A question on the estimation of reliability in longitudinal data

[Q] A question on the estimation of reliability in longitudinal data