Cognitive Model Validation — Good Practices

Cognitive Model Validation — Good Practices

Status: emerging
Last updated: 2026-06-08
Sources: Heathcotemodelingintro
Tags: [cognitive-modeling, methodology, evidence-accumulation, diffusion-model, parameter-recovery, model-selection, response-time]

Summary

Cognitive process models decompose observed behaviour into estimates of latent cognitive processes, but the validity of any conclusion drawn from them depends on checks that are often left implicit. Heathcote, Brown and Wagenmakers (2015) set out five sanity checks every modeller should apply: parameter recovery simulations, tests of selective influence, quantification of parameter uncertainty, display and checking of model fit, and model selection. The chapter illustrates each with evidence accumulation models — chiefly the diffusion model and the linear ballistic accumulator (LBA) — fitted to choice response-time data. The guiding aim is a defensible connection from observed behaviour to unobserved psychological process, which is the same inference problem that underlies gaze-based measurement of cognitive states.

Body

Context

Heathcote, Brown and Wagenmakers (2015), writing in An Introduction to Model-Based Cognitive Neuroscience, address a methodological problem rather than an empirical one: how to know when the parameters of a cognitive process model can be trusted. Their scope is the family of evidence accumulation models for simple decisions — the diffusion model (Ratcliff, 1978) and the linear ballistic accumulator (Brown & Heathcote, 2008) — and the worked examples use lexical-decision data. The chapter's starting premise, that cognitive processes cannot be observed directly and must be measured through their impact on overt behaviour, is the same epistemology that motivates indirect indices of cognitive state in this knowledge base, such as the task-evoked pupillary response Pupil Dilation Cognitive Load and model-based event classification of raw gaze Fixation Saccade Detection. It also supplies the modelling-standards backdrop for predictive cognitive models of attention such as the one in Visual Occlusion Attentional Demand.

Key Points

The inference problem. Observed task performance is the end result of an unknown combination of several cognitive processes — for a speeded line-tilt judgement, perceptual encoding speed, evidence accumulation efficiency, decision threshold, and motor execution all contribute — so behaviour cannot be read blindly as an index of any one process (PDF pp. 1–2, orig. pp. 25–26). A process model is what untangles them, and the LBA is given as an example in which two accumulators race to a threshold and the threshold parameter expresses response caution (PDF p. 2, orig. p. 26). Conclusions are only as good as the model's plausibility, which is why explicit checks are required.

Parameter recovery. A parameter recovery simulation generates synthetic data from known parameter values, refits with the same procedure used on real data, and compares recovered estimates against the true values, giving a read on both bias (accuracy) and variability (reliability) (PDF p. 3, orig. p. 27). The settings should mirror the real study — same optimiser, sample size, and effect sizes — except when the goal is itself methodological, such as asking what sample size is needed to identify an effect (PDF p. 3, orig. p. 27). The same logic extends to model recovery: synthetic data from each candidate model are fit with all candidates, and the accuracy of recovering the generating model measures how well the models can be discriminated (PDF p. 3, orig. p. 27).

Selective influence. Testing selective influence checks whether an experimental manipulation moves only the parameter it is theorised to move and leaves the others unchanged (PDF p. 7, orig. p. 31). The conventional assumption in evidence accumulation models is that stimulus factors affect accumulation rates while instruction factors such as speed-versus-accuracy emphasis affect caution and bias parameters; the chapter treats this as a hypothesis to be tested, not assumed (PDF pp. 7–8, orig. pp. 31–32).

Quantifying uncertainty. A point estimate alone is not interpretable without a measure of its uncertainty. A parameter estimated at 50% means little if its confidence (or credible) interval runs from 10% to 90% (PDF p. 21, orig. p. 45, Exercise solution). The chapter sets out interval methods — including bootstrap resampling of data or of fits, and Bayesian credible intervals from posterior sampling — as the means of attaching uncertainty to estimates before any condition or group comparison is drawn (PDF pp. 8–11, orig. pp. 32–35).

Showing model fit. Fit must be displayed, not just summarised by a single statistic. For continuous response-time data the recommended approach summarises each RT distribution by quantiles (commonly the 0.1, 0.3, 0.5, 0.7, 0.9 percentiles) and plots them, often as quantile-probability plots (PDF p. 12, orig. p. 36). Both model and data should be plotted as points: plotting only lines can hide mis-fit because intersecting lines can fool the eye into reading an inaccurate fit as accurate (PDF p. 14, orig. p. 38). Plots should carry error bars appropriate to the comparison of interest — within-subject standard errors for within-subject effects — and per-participant plots are recommended alongside the average (PDF p. 14, orig. p. 38).

Model selection. With several parameters and several design factors the number of model variants grows explosively — 2^(f×m) for f factors freely assigned to m parameter types under hierarchical assumptions alone (PDF p. 15, orig. p. 39). Selection cannot rest on goodness-of-fit, since the least-constrained model always fits best and over-fitting would follow; penalised-misfit criteria are used instead, with AIC = D + 2k and BIC = D + k·log(n) as the maximum-likelihood options and DIC/BPIC or the Bayes factor as Bayesian alternatives, BIC imposing the harsher complexity penalty at typical sample sizes (PDF p. 16, orig. p. 40). The worked lexical-decision example fits 512 diffusion variants and shows AIC and BIC selecting models that differ on a theoretically central point — whether emphasis affects accumulation rate — which means the criteria alone do not settle the question and further evidence is needed (PDF pp. 17–18, orig. pp. 41–42). Selection therefore requires judgement: over-fitting shows up as unstable or psychologically nonsensical parameter values, and model averaging is sometimes preferable to selecting a single model (PDF pp. 15–16, orig. pp. 39–40).

Transparency and tolerable misfit. Beyond the five checks, the chapter stresses transparency — sharing model code and data so results can be replicated, tested, and extended (PDF pp. 19–20, orig. pp. 43–44). On absolute fit it invokes Box's (1979) dictum that "all models are false but some are useful": some mis-fit can be tolerated when no better alternative exists, provided the model captures the theoretically important features of the data, in which case the parameter estimates give a more meaningful distillation than summary statistics such as mean RT alone (PDF p. 19, orig. p. 43).

Conclusion

Heathcote, Brown and Wagenmakers (2015) present the five checks as a routine suite rather than a one-off validation, and as especially necessary for models that are new or untested. Their overall position is that a model earns trust not from a single fit statistic but from passing a sequence of checks — recoverable parameters, tested selective influence, quantified uncertainty, displayed fit, and justified selection — backed by transparent code and data. The takeaway mirrors Box: the goal is not a true model but a useful one whose parameters yield a reliable connection from observed behaviour to unobserved psychological process.

References

Box, G.E.P. (1979) in Robustness in Statistics. New York: Academic Press, pp. 201–236. To be validated.

Brown, S.D. & Heathcote, A. (2008) Cognitive Psychology, 57, p. 153. To be validated.

Heathcote, A., Brown, S.D. & Wagenmakers, E.-J. (2015) 'An Introduction to Good Practices in Cognitive Modeling', in Forstmann, B.U. & Wagenmakers, E.-J. (eds.) An Introduction to Model-Based Cognitive Neuroscience. New York: Springer, pp. 25–48. doi: 10.1007/978-1-4939-2236-9_2. heathcote2015goodpractices

Ratcliff, R. (1978) Psychological Review, 85, p. 59. To be validated.

Open Questions

  • The five checks were developed for response-time and choice data. Which of them transfer directly to gaze-derived measures (fixation durations, scanpath features, pupil time series), and which need adaptation given the autocorrelated, non-independent structure of gaze data?
  • Evidence accumulation models assume trials are independently distributed, yet sequential effects can be strong (PDF p. 19, orig. p. 43). Do gaze-based cognitive measures face the same independence violation across successive fixations?