Appearance-Based Gaze Estimation


title: Appearance-Based Gaze Estimation
status: established
last_updated: 2026-05-31
sources: Zhang 2015 Mpiigaze, Krafka 2016 Gazecapture
tags: [gaze-estimation, deep-learning, dataset, MPIIGaze, GazeCapture, mobile, appearance-based, computer-vision, benchmark]


Appearance-Based Gaze Estimation

Status: established
Last updated: 2026-05-31
Sources: Zhang 2015 Mpiigaze, Krafka 2016 Gazecapture
Tags: [gaze-estimation, deep-learning, dataset, MPIIGaze, GazeCapture, mobile, appearance-based, computer-vision, benchmark]

Summary

Appearance-based gaze estimation predicts gaze direction directly from eye or face images using learned models, without requiring dedicated eye tracking hardware beyond a standard camera. Zhang et al. (2015) introduced MPIIGaze, a large-scale in-the-wild dataset (213,659 images) for this task, with a LeNet CNN giving 6.3° mean angular error under person-independent evaluation. Krafka et al. (2016) extended the paradigm to mobile devices with GazeCapture (almost 2.5 million frames, 1,474 participants), demonstrating feasibility at scale through crowdsourced data collection.

Body

Context

This article draws on two CVPR papers that established appearance-based gaze estimation as a benchmarked task: Zhang et al. (2015), who introduced the MPIIGaze in-the-wild dataset and a person-independent (leave-one-person-out) evaluation, and Krafka et al. (2016), who extended the paradigm to mobile devices at crowdsourced scale with GazeCapture. Both predict gaze from eye or face images using learned models rather than dedicated eye-tracking hardware. Within this knowledge base the article is the computer-vision route to gaze data, an alternative to the hardware trackers assumed elsewhere; it connects to Fixation Saccade Detection, which requires the high-precision signal these methods do not yet match, and to Gaze Based Hci And Usability, where camera-only estimation would lower the barrier to gaze interaction.

Key Points

Zhang et al. (2015) frame the appearance-based paradigm against model-based estimation, which relies on observations of specific geometric eye features (corneal reflection, eye shape) and typically assumes accurate 3D head pose as input, a strong assumption in unconstrained settings (PDF pp. 1–2). Appearance-based methods instead learn a mapping from an eye image to a gaze vector and can work with low-resolution eye images, trading geometric precision for generality across cameras and conditions (PDF p. 2).

Zhang et al. (2015) built MPIIGaze by logging frames from 15 participants over more than three months of natural everyday laptop use, producing 213,659 images across diverse illumination, head poses, and eye appearances; ground truth came from on-screen fixation targets — shrinking circles with a central dot that participants fixated and confirmed by keypress (PDF pp. 1, 4). A LeNet-based CNN gave 6.3° mean angular error and performed best among the evaluated methods on a within-dataset leave-one-person-out (person-independent) evaluation (PDF pp. 1, 8).

Krafka et al. (2016) demonstrated mobile feasibility at scale with GazeCapture: almost 2.5 million frames (2,445,504) from 1,474 participants collected via an iOS app, the first large-scale crowdsourced eye-tracking dataset (PDF p. 1, orig. p. 2176). Their iTracker CNN takes the left-eye, right-eye, and face images plus a face grid encoding head location, and predicts the 2D on-screen gaze point in centimetres (PDF p. 5, orig. p. 2180). Without calibration it achieved 1.71 cm error on phones and 2.53 cm on tablets; using the full set of 13 calibration points these improved to 1.34 cm and 2.12 cm (PDF pp. 1, 7, orig. pp. 2176, 2182).

On limitations, both papers note the gap between learned estimation and dedicated hardware, and that appearance-based accuracy depends on covering a wide enough range of head poses and conditions in training; cross-dataset degradation (the domain gap) follows from differences in camera, illumination, head-pose range, and demographics. Later work such as ETH-XGaze (Zhang et al., 2020) added extreme-pose coverage, but calibration-free angular accuracy in this family remains well above that of hardware trackers [synthesis across Zhang et al. (2015), Krafka et al. (2016), and Zhang et al. (2020)].

Conclusion

The two papers are complementary rather than competing: Zhang et al. (2015) established the in-the-wild dataset and the person-independent (leave-one-person-out) evaluation that defines how generalisation is measured, and Krafka et al. (2016) extended the same paradigm to mobile devices and showed it scales through crowdsourcing. They agree on the central trade-off — generality and accessibility in exchange for precision — and on the shared difficulty of generalising across cameras, illumination, head pose, and demographics. The unresolved point is accuracy: calibration-free estimation in this family remains well short of hardware trackers, so for fixation-level precision appearance-based methods are not yet a standalone alternative.

References

Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W. & Torralba, A. (2016) 'Eye tracking for everyone', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), pp. 2176–2184. doi: 10.1109/CVPR.2016.239. krafka2016gazecapture

Zhang, X., Sugano, Y., Fritz, M. & Bulling, A. (2015) 'Appearance-based gaze estimation in the wild', in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 4511–4520. doi: 10.1109/CVPR.2015.7299081. zhang2015mpiigaze

Open Questions

  • At what angular accuracy does appearance-based gaze estimation become sufficient for fixation detection in usability research?
  • Can device-specific calibration be replaced by a universal calibration model trained on diverse hardware?
  • An earlier draft contained several figures not verifiable in the sources, now corrected: Krafka et al.'s (2016) calibrated tablet error is 2.12 cm, not 1.78 cm, and the calibration uses 13 points, not five; Zhang et al.'s (2015) ground truth comes from on-screen fixation targets, not a "synchronised remote tracker," and the paper reports a person-independent leave-one-person-out evaluation rather than introducing a named "cross-person protocol." The claim that MPIIGaze "covers only about ±20° yaw" is not stated in the paper and was removed.
  • The specific accuracy figures for the limitations comparison ("above 3°" for calibration-free, "0.4–0.5° for hardware trackers") were not in either compiled source and have been replaced with a sourced qualitative statement; a dedicated source is needed if exact numbers are wanted.