Google DeepMind released a paper and accompanying benchmark targeting long-horizon agent evaluation. The benchmark scores agents on multi-day workflows (representative examples: onboard a new vendor end-to-end, close out a quarterly compliance review, run a multi-step research project) with realistic tool surfaces, intermediate failures, and the kind of context-management problems that short-task evals never surface.
The frontier numbers are the news. Top closed models clear roughly 30% of the harder task tier; open models sit in the high teens. SWE-Bench-style code evals do not predict performance here, which is the structural finding. The failure modes the paper categorizes are mostly about state management across long traces, not about reasoning capability on any single step.
For operators planning agentic deployments, the practical implication is bounded scope. The category of work where agents actually complete real end-to-end jobs without human checkpoints is narrower than the demo cycle suggests. Production deployments that segment the work into well-defined short subtasks with explicit handoffs are still the architecture that ships, and this paper is the cleanest evidence yet for why that pattern outperforms autonomy.