Google DeepMind dropped a paper and benchmark targeting long-horizon agent evaluation, which is the polite way of saying “let’s actually measure what happens when you tell an agent to do something that takes longer than a coffee break.” The benchmark scores agents on multi-day workflows. Representative tasks: onboard a new vendor end-to-end, close out a quarterly compliance review, run a multi-step research project. Realistic tool surfaces, intermediate failures, and the context-management nightmares that short-task evals conveniently never surface.

The frontier numbers are the news, and they are not flattering. Top closed models clear roughly 30% of the harder task tier. Open models sit in the high teens. SWE-Bench-style code evals do not predict performance here, which is the structural finding nobody on demo-day stages wants to dwell on. The failure modes the paper catalogues are almost entirely about state management across long traces, not reasoning capability on any single step. The agents can think. They just forget what they were doing.

For operators planning agentic deployments, the practical takeaway is bounded scope. The category of work where agents actually finish real end-to-end jobs without human checkpoints is narrower than the keynote highlight reel implies. Production architectures that segment work into well-defined short subtasks with explicit handoffs are the pattern that ships, and this paper is the cleanest evidence yet for why “autonomy” mostly belongs in slide decks.

deepmindevalsagentsbenchmarksresearch