Agentic Search and Agent Harnesses¶
Status: emerging
Last updated: 2026-06-09
Sources: 2605.15184V1
Tags: [agentic-search, retrieval-augmented-generation, lexical-search, semantic-search, grep, agent-harness, tool-calling, context-engineering, llm-evaluation, machine-learning]
Summary¶
Sen et al. (2026) ask whether lexical search (grep) or semantic vector search is the better retrieval strategy for tool-equipped LLM agents, and find that the question cannot be answered without specifying the agent harness and the way tool results are delivered. Across a 116-question subset of the LongMemEval long-memory benchmark, inline grep beats inline vector retrieval for every harness–model pair tested, but routing results to files instead of injecting them into the context inverts or erases that advantage, and moving the same model between a custom harness (Chronos) and a provider-native CLI (Claude Code, Codex, Gemini CLI) shifts accuracy by margins comparable to swapping the retriever itself. A second experiment adds irrelevant conversation history as noise and shows both retrievers degrade only mildly, with the grep–vector crossover depending on harness and backbone rather than corpus size. The paper's argument is that retrieval mechanics, harness orchestration, and result-delivery path form a single jointly-evaluated system, not three independent design choices.
Body¶
Context¶
Sen et al. (2026), a team at PricewaterhouseCoopers, report an empirical study of how retrieval strategy interacts with agent architecture in end-to-end agentic workflows — a combination the authors argue prior information-retrieval work has left under-examined by benchmarking retrievers in fixed, non-agentic pipelines (PDF p. 1, printed p. 1). The study crosses three design dimensions: retrieval strategy (lexical grep vs. semantic vector), agent harness (a custom LangChain harness called Chronos vs. three provider-native CLI agents), and tool-calling delivery (inline tool messages vs. file-based "programmatic" delivery the agent must read). Within this knowledge base the article opens an agentic-AI and machine-learning-retrieval strand distinct from the existing autonomous-vehicle material, and connects to Human Centered Design Of Ai on how AI systems are built and evaluated and to Ironies Of Automation in treating the surrounding system — here the harness rather than the human — as the determinant of performance.
Key Points¶
The study separates two design dimensions that are usually conflated. The retrieval strategy is the matching method: lexical search (grep, BM25, regex) performs exact or pattern-based matching over raw text and needs no embedding model or vector index, while semantic search embeds queries and documents into a shared space and retrieves approximate nearest neighbours, optionally reranked (PDF p. 2, printed p. 2). The agent harness is the environment layer that builds the prompt, dispatches tool calls, and decides when to stop. Custom harnesses (built on agent frameworks or SDKs, typically following the ReAct pattern) give fine-grained control over the prompt, tool definitions, result formatting, and context management; provider-native CLI harnesses embed tool calling in a shell where the model runs bash utilities such as grep, find, and cat directly, trading control for low setup cost and provider-optimised context engineering (PDF pp. 2–3, printed pp. 2–3). When grep is a native shell tool, the line between "retrieval strategy" and "agent capability" blurs, because the agent composes its own search commands rather than calling a fixed API (PDF p. 3, printed p. 3).
A third dimension, tool-calling architecture, governs how results reach the model. Inline (standard) delivery appends results directly to the conversation context, which is simple but pressures the context window — the "context rot" the authors associate with degraded long-horizon performance. Programmatic (file-based) delivery writes results to disk and hands the model a path, decoupling result size from context pressure and enabling progressive disclosure, at the cost of an extra read step that the model must execute reliably (PDF p. 3, printed p. 3). The authors frame delivery mode as itself a context-engineering decision.
The evaluation runs on a 116-question subset of LongMemEval-S, which tests question answering over long multi-session conversations across six categories (knowledge-update, multi-session, single-session-assistant, single-session-preference, single-session-user, and temporal-reasoning); conversations are preprocessed with the Chronos pipeline so that temporal events are surfaced as structured records, and a GPT-4o auxiliary grader issues binary judgements held fixed across conditions (PDF pp. 3–4, printed pp. 3–4). Five inference models span capability levels: Claude Opus 4.6, Claude Haiku 4.5, GPT-5.4, Gemini 3.1 Pro, and Gemini 3.1 Flash-Lite (PDF p. 4, printed p. 4).
Experiment 1 holds the full per-question haystack and varies retrieval mode and delivery. With inline delivery, lexical search is uniformly stronger: inline grep exceeds inline vector for every harness–model pair, the widest margin being Chronos with Gemini 3.1 Flash-Lite (86.2% vs. 62.9%) and the narrowest Claude Code with Claude Opus 4.6 (76.7% vs. 75.0%) (PDF p. 4, printed p. 4; Table 1, PDF p. 5). The harness matters as much as the retriever: the same Claude Opus 4.6 backbone reaches 93.1% under Chronos but 76.7% under Claude Code, so changing the harness moves the ceiling by roughly as much as swapping retrievers within a fixed harness (PDF p. 4, printed p. 4). Programmatic delivery reshuffles the comparison — programmatic vector exceeds programmatic grep on five of ten harness–model pairs — and can cause sharp regressions, the starkest being Codex with GPT-5.4 falling from 93.1% under inline grep to 55.2% under programmatic grep (PDF p. 4, printed p. 4). The authors read this as evidence that file-based routing is a tool-use stress test: when the "locate, open, integrate" step is brittle, accuracy collapses independently of retrieval quality (PDF p. 5, printed p. 5).
Experiment 2 progressively adds distractor sessions (session limits labelled s5, s10, s20, s30, and full, where full is the 39–66-session haystack), holding delivery fixed and pairing grep-only against vector-only tables so each column reflects identical distractor exposure (PDF pp. 5–6, printed pp. 5–6). Both methods degrade only mildly as noise grows and grep leads on average (Figure 1, PDF p. 6), but accuracy is not monotone in corpus size and retriever ordering depends on the harness: Claude Code favours grep for Opus and Haiku at every configuration, Gemini CLI favours vector for Gemini 3.1 Pro throughout, and Chronos shows crossings where vector leads early and grep can overtake at larger session limits (PDF p. 6, printed p. 6; Tables 2–3, PDF p. 7). The authors caution that because distractors are resampled when the session limit changes, mid-grid peaks reflect favourable sampling interference rather than a smooth capacity law, and that some CLI rows (Codex vector intermediates, Codex grep scaling) are incomplete (PDF p. 7, printed p. 7). A per-category breakdown for Chronos grep-only at full haystack appears in Table 4 (PDF p. 9, printed p. 9).
Conclusion¶
Sen et al. (2026) conclude that retrieval strategy, harness orchestration, and delivery path must be evaluated as one system rather than as independent design choices. Their headline finding — grep generally beats vector end-to-end — is deliberately qualified: it holds for inline delivery and for the long-memory conversational QA they study, where answers are often licensed by literal spans (exact dates, counts, preferences) that exact matching recovers without an embedding bottleneck. The authors explicitly do not claim grep beats vector in general, and note that in domains where evidence is rarely literal (scientific synthesis over paraphrases, visual documents, code semantics) dense or hybrid retrieval may look different (PDF p. 7, printed p. 7). The constructive takeaway for benchmarking is that reporting only BM25 versus ANN in a static pipeline under-estimates the variance that agent scaffolding introduces; "default to vector" recommendations should be conditioned on harness, backbone strength, and whether the task rewards literal-span recovery or conceptual blending (PDF pp. 5, 8, printed pp. 5, 8).
Related¶
- Human Centered Design Of Ai — how AI systems are designed and evaluated with the human and the surrounding system in view; this article extends that framing to the agent harness as an evaluation variable
- Ironies Of Automation — both treat performance as a property of the whole system around the core component, not the component alone
References¶
Sen, S., Kasturi, A., Lumer, E., Gulati, A. and Subbiah, V.K. (2026) 'Is Grep All You Need? How Agent Harnesses Reshape Agentic Search', arXiv preprint arXiv:2605.15184 [cs.CL]. Available at: https://arxiv.org/abs/2605.15184 (Accessed: 9 June 2026). sen2026grep
Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K.-W. and Yu, D. (2025) 'LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory', in Proceedings of the International Conference on Learning Representations (ICLR). To be validated.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. and Cao, Y. (2023) 'ReAct: Synergizing Reasoning and Acting in Language Models', in Proceedings of ICLR. To be validated.
Open Questions¶
- The paper establishes that grep wins end-to-end on inline long-memory QA but stresses this is task-bound. Whether the lexical advantage survives on corpora where evidence is paraphrased rather than literal — scientific synthesis, code, visual documents — is left open and is the obvious next test.
- Programmatic file-based delivery is motivated as relief from context pressure yet sometimes collapses accuracy because the agent fails to close the read-integrate loop. Which model and harness properties make file-based delivery reliable, rather than a tool-use stress test, is unresolved.
- The persistent per-vendor biases (grep-favouring on Claude Code, vector-favouring on Gemini CLI for Gemini 3.1 Pro) imply that migrating between CLI stacks is not retrieval-interchangeable even on a byte-identical corpus. The mechanism — default hints, stdout chunking, tool-error surfaces — is hypothesised but not isolated through trace-level attribution.
- Incomplete Codex rows mean the paper cannot yet give a vendor-complete picture of how CLI grep ages against CLI vector under matched distractor caps.
- Cross-KB: the harness-as-evaluation-variable result connects to supervisory-control and automation-transparency themes in the remote-operations corpus, where system framing likewise determines outcomes more than the isolated component.