title: Fixation and Saccade Detection
status: established
last_updated: 2026-05-31
sources: Salvucci Goldberg 2000 Identifying Fixations Saccades, Birawo 2022 Eye Movement Event Detection Review, Zemblys 2019 Gazenet Deep Learning
tags: [eye-tracking, event-detection, fixation, saccade, algorithms, machine-learning, deep-learning, methodology]

Fixation and Saccade Detection¶

Status: established
Last updated: 2026-05-31
Sources: Salvucci Goldberg 2000 Identifying Fixations Saccades, Birawo 2022 Eye Movement Event Detection Review, Zemblys 2019 Gazenet Deep Learning
Tags: [eye-tracking, event-detection, fixation, saccade, algorithms, machine-learning, deep-learning, methodology]

Summary¶

Fixation and saccade detection — the classification of raw gaze samples into discrete eye movement events — is a prerequisite for all downstream eye tracking analysis. The field progressed from hand-crafted threshold algorithms (Salvucci & Goldberg, 2000) through machine learning classifiers (Zemblys et al., 2018) to end-to-end deep neural networks (Zemblys et al., 2019), and a controlled comparison on a common dataset (Birawo & Kasprowski, 2022) confirms that the learned methods outperform threshold-based ones while threshold methods remain heavily parameter-dependent.

Body¶

Context¶

This article draws on three sources that trace event detection from hand-crafted thresholds to deep learning: Salvucci and Goldberg's (2000) foundational comparison of classical fixation-identification algorithms, Zemblys, Niehorster and Holmqvist's (2019) end-to-end deep-learning detector (gazeNet), and Birawo and Kasprowski's (2022) evaluation comparing threshold, machine-learning, and deep-learning algorithms on a common dataset. The machine-learning predecessor Zemblys et al. (2018) is cited from the gazeNet paper rather than compiled directly (no PDF in RAW). Each source addresses the same task — classifying raw gaze samples into fixations, saccades, and related events — and the set sits in this knowledge base as the methodological foundation for spatial-gaze analysis, upstream of the metrics used in applied work such as Eye Tracking In Surgery and complementary to the cognitive-load strand in Pupil Dilation Cognitive Load and the basic-vision account in Fixational Eye Movements.

Key Points¶

Salvucci and Goldberg (2000) established the foundational taxonomy, categorising fixation-identification algorithms along two dimensions: the type of gaze information used (velocity, dispersion, or area-of-interest) and the temporal criteria applied (duration thresholds and local adaptivity) (PDF p. 2, orig. p. 72). They described and compared five representative algorithms — I-VT (velocity threshold), I-DT (dispersion threshold), I-HMM (hidden Markov model), I-MST (minimum spanning tree), and I-AOI (area of interest) (PDF pp. 3–5, orig. pp. 73–75). Comparing the methods on qualitative characteristics — accuracy, speed, robustness, ease of implementation, and parameter count — they found I-VT efficient but prone to noise-induced "blips" when velocities hover near threshold, I-DT linear-time and robust but the most sensitive to its two interdependent parameters, and I-HMM more robust than I-VT through its probabilistic use of sequential information (PDF pp. 6–7, orig. pp. 76–77). They stressed that identification is inherently a subjective, statistical description with no single ground truth, so the choice of algorithm can dramatically affect the resulting fixations (PDF pp. 1, 8, orig. pp. 71, 78).

Zemblys, Niehorster and Holmqvist (2019) introduced gazeNet, an end-to-end deep network that takes raw eye-tracking data as input and classifies it into fixations, saccades, and post-saccadic oscillations (PSO) without hand-crafted signal features or user-set thresholds (PDF pp. 1–2, orig. pp. 840–841). The architecture is two convolutional layers followed by three bidirectional recurrent layers and a fully connected layer, trained with a weighted cross-entropy loss; smooth pursuit was deliberately excluded, since the velocity-based training data (Lund2013) had pursuit trials removed as out of scope (PDF p. 12, orig. p. 851). Evaluated on two further datasets — GazeCom and humanFixationClassification — gazeNet generalised well, reaching sample-level F1 scores around 0.8–0.9 and approaching human-coder agreement, and its end-to-end design removes the preprocessing and manual tuning that classical detectors require (PDF pp. 16–19, orig. pp. 855–858).

The machine-learning predecessor (Zemblys et al., 2018), cited within gazeNet, used a random forest on hand-crafted features and is reported there as performing well and remaining stable down to 200 Hz sampling (PDF p. 2, orig. p. 841).

Birawo and Kasprowski (2022) compared four algorithms drawn from three families — I-VT and I-DT (threshold), a random forest (machine learning), and a CNN (deep learning) — on a single dataset recorded with an SMI HiSpeed 1250 system, using sample-by-sample comparison and agreement with human coders, restricted to fixation, saccade, and PSO classification (PDF p. 1). All methods performed well for fixations and saccades, but classification differed substantially, with the largest divergence on PSO (PDF p. 6). The random forest and CNN outperformed the threshold-based I-VT and I-DT across all performance metrics and supported multi-class classification, while threshold values critically affected I-VT and I-DT results, making an optimum threshold hard to find (PDF p. 16). Smooth pursuit was not considered, because velocity alone cannot separate it from fixation (PDF p. 16).

Conclusion¶

The three sources are complementary rather than competing, and read as a progression: Salvucci and Goldberg (2000) defined the problem, showing that identification is subjective with no single ground truth and that the choice of algorithm changes the resulting fixations; Zemblys et al. (2019) removed manual feature design and thresholding with an end-to-end network that generalises across datasets at near-human agreement. Birawo and Kasprowski's (2022) controlled comparison confirms the trajectory: on a common dataset the learned methods (random forest, CNN) outperform the threshold-based I-VT and I-DT across all metrics, and threshold-based methods remain heavily dependent on parameter choice. The consistent message across all three is that learned detectors lead on accuracy while threshold methods stay attractive only where transparency, speed, or the absence of labelled training data rules out a trained model.

References¶

Birawo, B. & Kasprowski, P. (2022) 'Review and evaluation of eye movement event detection algorithms', Sensors, 22(22), p. 8810. doi: 10.3390/s22228810. birawo2022eventdetection

Salvucci, D. D. & Goldberg, J. H. (2000) 'Identifying fixations and saccades in eye-tracking protocols', in Proceedings of the 2000 Symposium on Eye Tracking Research & Applications, pp. 71–78. ACM. doi: 10.1145/355017.355028. salvucci2000fixations

Zemblys, R., Niehorster, D. C. & Holmqvist, K. (2019) 'gazeNet: End-to-end eye-movement event detection with deep neural networks', Behavior Research Methods, 51(2), pp. 840–864. doi: 10.3758/s13428-018-1133-5. zemblys2019gazenet

Open Questions¶

Does gazeNet generalise to VR/HMD gaze data recorded at lower spatial resolution than desktop trackers?
What is the effect of different event detection algorithms on workload-sensitive metrics such as fixation duration variability in operational settings?
An earlier draft attributed to Birawo and Kasprowski (2022) several claims not in the source: evaluation of "12 algorithms across four benchmark datasets," scoring of "onset-latency error and noise robustness," and a finding that "the Nyström and Holmqvist (2010) adaptive velocity threshold" was the best classical algorithm. The paper actually compares four algorithms (I-VT, I-DT, random forest, CNN) on one SMI HiSpeed 1250 dataset by sample-by-sample agreement, and does not assess any Nyström–Holmqvist 2010 method. These claims were removed; if a Nyström–Holmqvist comparison is wanted, it needs its own source.
An earlier draft described gazeNet (Zemblys et al., 2019) as classifying smooth pursuit with a CNN-plus-residual / softmax architecture evaluated on "Lund 2013." The paper excludes smooth pursuit, uses two convolutional plus three bidirectional recurrent layers (no residual/softmax terms), and uses Lund2013 only as training data, evaluating generalisation on GazeCom and humanFixationClassification. Corrected above.
The Salvucci and Goldberg (2000) comparison is qualitative on a single equation-solving sample protocol, not a quantitative test on reading/visual-search data; an earlier draft's claim of substantial measured disagreement in "number, duration, and location" across such data was not supported and has been removed.