Gaze Interaction in Extended Reality¶

Status: established
Last updated: 2026-05-31
Sources: 3491207.Pdf
Tags: [eye-tracking, gaze-interaction, XR, AR, VR, mixed-reality, HMD, Midas-touch, dwell-time, foveated-rendering, collaboration, survey]

Summary¶

Plopski et al. (2022) survey gaze interaction and eye tracking in head-worn extended reality (XR), reviewing 215 publications from 1985 to 2020. They organise the field into three categories: explicit eye input, where gaze intentionally selects and manipulates content; implicit or adaptive and attentive interfaces, where the system reacts to gaze without the user issuing a command; and collaboration, where gaze functions as a social cue between users, avatars, and agents. Their central finding is that gaze in XR is at an early stage across all three areas — explicit input is best explored, adaptive interfaces remain simple prototypes, and most work lacks comparative studies and longitudinal evaluation.

Body¶

Context¶

This article draws on a single source: Plopski et al.'s (2022) ACM Computing Surveys literature review of gaze interaction and eye tracking specific to head-worn XR, screening 1,278 Scopus records down to a final corpus of 215 papers classified into three interaction categories. Within this knowledge base it is the XR-specific extension of the human-computer-interaction strand in Gaze Based Hci And Usability — that article cites this survey from its digest, where this article compiles it in full. It rests on the involuntary-movement problem detailed in Fixational Eye Movements (the Midas touch is its interaction-design consequence), depends for its rendering and selection techniques on the kind of real-time camera-based estimation examined in Appearance Based Gaze Estimation, and the cognitive-load context sources it discusses for adaptive interfaces connect to Pupil Dilation Cognitive Load.

Key Points¶

Scope and method. The survey reviews gaze interaction and eye tracking research for head-mounted displays published since 1985, totalling 215 publications (PDF p. 1, orig. p. 53). A Scopus search on XR and eye-tracking index terms returned 1,278 papers from 1985 to May 2020; after removing 331 papers that mentioned the keywords without using XR or eye tracking, one for plagiarism, and 90 that could not be accessed, an 856-paper corpus was classified, then narrowed to a final 215 (PDF pp. 3–5, orig. pp. 55–57). Of the 215, 99 used eye tracking for explicit input, 53 presented implicit user interfaces, and 63 focused on collaborative gaze interaction (PDF p. 5, orig. p. 57). The classification followed Majaranta and Bulling's (2014) eye-tracking application continuum, organised around intentional versus unintentional interaction and online versus offline responsiveness (PDF p. 2, orig. p. 54). The authors frame the review around three questions: the main categories of gaze interaction and eye tracking for XR, the sub-categories that attracted more attention, and emerging future directions (PDF p. 3, orig. p. 55).

Explicit eye input. Gaze is identified as a natural means of interaction because humans look at what they are attending to or planning to attend to (PDF p. 6, orig. p. 58). The recurring obstacle is the Midas touch problem, described by Jacob (1990): because the eyes are always looking somewhere, unintentional gaze can trigger unwanted interactions such as selecting a button merely glanced at (PDF p. 6, orig. p. 58). Dwell time — holding the gaze on a target for a set duration to trigger input — is the most common eye-only solution, but it can cause fatigue and slow interaction (PDF pp. 7–9, orig. pp. 59–61). Dwell-time alternatives include smooth-pursuit selection, robust to target size and improving with movement radius (Khamis et al.), half-blink-plus-gaze input (Lee et al.), and eye-gesture interfaces (PDF p. 9, orig. p. 61). Comparative studies of eye-only versus head-based input produced inconsistent findings: Blattgerste et al. found gaze more accurate than head input, while Kyto et al. found the opposite (PDF pp. 7–8, 22, orig. pp. 59–60, 74). Multimodal combinations — gaze with traditional input, speech, gestures, head rotation, or BCI — create richer interaction; Pfeuffer et al.'s Gaze + Pinch uses gaze to indicate the object and a pinch to manipulate it (PDF pp. 9–11, orig. pp. 61–63).

Implicit, adaptive, and attentive interfaces. A second category uses real-time gaze as a context source so the system adapts implicitly rather than on explicit command (PDF p. 11, orig. p. 63). Adaptive interfaces are those that "remain well designed even as the world changes," and attentive interfaces — a sub-genre — are sensitive to the user's attention to minimise disruption (PDF p. 11, orig. p. 63). Applications include view management (placing labels and overlays relative to the user's gaze and focus distance), guiding attention through subtle brightness modulation, and rendering. For rendering, foveated rendering exploits the small high-acuity region of about 5° around the gaze centre, rendering only that portion at full resolution; tolerable end-to-end latency is reported at roughly 50–70 ms (PDF p. 15, orig. p. 67). Gaze has also been used to predict the onset of cybersickness and to drive saccade-contingent redirected walking, exploiting blindness to scene changes during saccades — Bolte and Lappe found rotations up to 5° and 0.5 m went unnoticed during saccades versus thresholds of 0.23° and 0.02 m during fixation (PDF p. 16, orig. p. 68). The authors judge this category the least developed: context models from gaze are still basic and only a narrow subset of possible adaptation targets has been explored (PDF pp. 11, 23, orig. pp. 63, 75).

Collaboration. The third category uses gaze as a shared social cue, grounded in the eye-mind hypothesis that gaze location corresponds to immediate thought (PDF p. 17, orig. p. 69). Work falls into four directions: representing a user's eye movements on their avatar, making virtual agents' gaze react to the user, sharing tracked gaze between worker and remote helper, and augmenting natural gaze with artificial cues such as pointers and rays (PDF p. 17, orig. p. 69). Vertegaal et al. showed gaze is a strong predictor of conversational attention — the person looked at is the one being listened to (88%) or spoken to (77%) (PDF p. 17, orig. p. 69). Wolff et al. and Steptoe et al.'s EyeCVE (2008) was among the first systems to map gaze onto avatars across networked CAVEs (PDF p. 17, orig. p. 69). For remote collaboration, sharing the worker's gaze as a pointer in the helper's view improved co-presence and performance in a LEGO construction task (Gupta et al.), and bidirectional shared gaze improved collaboration further (PDF p. 20, orig. p. 72).

Future directions and limitations. The authors expect base technologies (eye trackers, gaze estimation, display integration) to keep improving, making eye tracking ubiquitous in head-worn displays over the coming decade (PDF pp. 21–22, orig. pp. 73–74). They note a pervasive lack of a common evaluation baseline — tasks and metrics differ across studies, so comparative results conflict — and call for shared tasks and metrics (PDF p. 22, orig. p. 74). They also found almost no longitudinal studies, attributing this to the past scarcity of eye-tracking-equipped HMDs, and flag visual discomfort such as the vergence-accommodation conflict as an open concern (PDF pp. 22–23, orig. pp. 74–75). The review deliberately excludes gaze-based user modelling and passive eye monitoring from Majaranta and Bulling's (2014) continuum, as well as the privacy and security of gaze data, which the authors single out as a growing concern in consumer XR (PDF p. 25, orig. p. 77).

Conclusion¶

The survey concludes that eye gaze is increasingly incorporated into XR but that all three identified areas are still early — explicit input is the best explored, adaptive and attentive interfaces are the least developed, and collaboration shows clear benefits for tracked avatar gaze that the authors expect to become standard (PDF p. 24, orig. p. 76). Across categories the authors identify the same two structural weaknesses: gaze is often added to prototype systems without comparative evaluation against other modalities, and in some cases the studies report contradicting results without consensus (PDF p. 24, orig. p. 76). They position the field as one of recovered momentum — older concepts being rediscovered as eye-tracking hardware becomes accessible — and call for standardised evaluation and longitudinal study to mature it.

References¶

Bolte, B. & Lappe, M. (2015) 'Subliminal reorientation and repositioning in immersive virtual environments using saccadic suppression', IEEE Transactions on Visualization and Computer Graphics, 21(4), pp. 545–552. doi: 10.1109/TVCG.2015.2391851. To be validated. plopski2022xr

Jacob, R. J. K. (1990) 'What you look at is what you get: Eye movement-based interaction techniques', in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 11–18. doi: 10.1145/97243.97246. To be validated. plopski2022xr

Majaranta, P. & Bulling, A. (2014) 'Eye tracking and eye-based human–computer interaction', in Fairclough, S. H. & Gilleade, K. (eds.) Advances in Physiological Computing. London: Springer, pp. 39–65. doi: 10.1007/978-1-4471-6392-3_3. To be validated. plopski2022xr

Plopski, A., Hirzle, T., Norouzi, N., Qian, L., Bruder, G. & Langlotz, T. (2022) 'The eye in extended reality: A survey on gaze interaction and eye tracking in head-worn extended reality', ACM Computing Surveys, 55(3), pp. 1–39. doi: 10.1145/3491207. plopski2022xr

Vertegaal, R., Slagter, R., van der Veer, G. & Nijholt, A. (2001) 'Eye gaze patterns in conversations: There is more to conversational agents than meets the eyes', in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 301–308. doi: 10.1145/365024.365119. To be validated. plopski2022xr

Open Questions¶

What common set of tasks and evaluation metrics would let XR gaze-interaction techniques be compared across studies, given the conflicting comparative results the survey reports?
How do gaze-based XR interaction techniques perform in longitudinal, out-of-lab use, where almost no studies currently exist?
How should always-on gaze data in consumer XR be governed, given its encoding of identity, cognitive state, and health information?