Towards a science of scaling agent systems: When and why agent systems work

https://research.google/blog/towards-a-science-of-scaling-agent-systems-when-and-why-agent-systems-work/ • Feb 15, 2026 10:12

Extracto

AI agents — systems capable of reasoning, planning, and acting — are becoming a common paradigm for real-world AI applications. From coding assistants to personal health coaches, the industry is shifting from single-shot question answering to sustained, multi-step interactions. While researchers have long utilized established metrics to optimize the accuracy of traditional machine learning models, agents introduce a new layer of complexity. Unlike isolated predictions, agents must navigate sustained, multi-step interactions where a single error can cascade throughout a workflow. This shift compels us to look beyond standard accuracy and ask: How do we actually design these systems for optimal performance?

Resumen

Resumen Principal

La industria de la inteligencia artificial está experimentando una transición fundamental desde respuestas de una sola vez hacia agentes de IA capaces de razonar, planificar y actuar en interacciones multi-paso y sostenidas. Sin embargo, la creencia heurística de que "más agentes son mejores" para optimizar el rendimiento ha sido desafiada. Una nueva investigación, "Towards a Science of Scaling Agent Systems", desmiente esta suposición a través de una evaluación controlada a gran escala de 180 configuraciones de agentes. Este estudio deriva los primeros principios cuantitativos de escalado, revelando que el incremento de agentes no es una solución universal y puede incluso degradar el rendimiento si no se alinea con las propiedades específicas de la tarea. La investigación define la evaluación "agénetica" basándose en interacciones multi-paso, recopilación iterativa de información y refinamiento adaptativo de la estrategia, marcando un hito hacia un diseño de sistemas más científico y menos empírico.

Elementos Clave

Refutación del Mito "Más Agentes Es Mejor": Contrario a la intuición y a estudios previos, la investigación demuestra que la creencia de que un mayor número de agentes siempre mejora el rendimiento es falaz. La adición de agentes a menudo alcanza un techo e incluso puede degradar el rendimiento si la configuración no está alineada con las características específicas de la tarea, como su secuencialidad o paralelismo.
Definición de Evaluación Agéntica: Para evaluar cómo escalan los agentes, se propuso una nueva definición de tarea "agénetica" con tres propiedades cruciales: interacciones multi-paso sostenidas con un entorno externo, recopilación iterativa de información bajo observabilidad parcial, y refinamiento adaptativo de la estrategia basado en la retroalimentación ambiental. Esto va más allá de los benchmarks estáticos tradicionales.
Principios de Alineación y Penalización Secuencial: Se descubrió que los sistemas multi-agente obtienen ganancias masivas en tareas paralelizable (ej., razonamiento financiero, con mejoras del 80.9% en coordinación centralizada), donde la descomposición del problema es efectiva. Sin embargo, en tareas que requieren razonamiento secuencial estricto (ej., planificación), cada variante multi-agente probada degradó el rendimiento entre un 39% y un 70% debido a la sobrecarga de comunicación.
Arquitectura como Característica de Seguridad y Modelo Predictivo: La arquitectura tiene un impacto crítico en la fiabilidad del sistema, observándose una amplificación de errores de 17.2x en sistemas independientes frente a 4.4x en sistemas centralizados, que actúan como "cuello de botella de validación". Además, se desarrolló un modelo predictivo (R^2 = 0.513) que utiliza propiedades de la tarea como el número de herramientas y la descomponibilidad para identificar correctamente la estrategia de coordinación óptima en el 87% de las configuraciones de tareas no vistas.

Análisis e Implicaciones

Esta investigación es fundamental para el diseño y despliegue de agentes de IA en el mundo real, permitiendo a los desarrolladores trascender las heurísticas y tomar decisiones de ingeniería basadas en principios. Al comprender las propiedades inherentes de una tarea, se puede seleccionar la arquitectura de agente óptima, lo que conduce a sistemas más inteligentes, seguros y eficientes.

Contexto Adicional

El estudio subraya que los modelos fundacionales más avanzados no eliminan la necesidad de sistemas multi-agente, sino que la aceleran, siempre y cuando la arquitectura sea la adecuada. Este trabajo fue realizado en colaboración por investigadores de Google Research, Google DeepMind y el ámbito académico.

Contenido

AI agents — systems capable of reasoning, planning, and acting — are becoming a common paradigm for real-world AI applications. From coding assistants to personal health coaches, the industry is shifting from single-shot question answering to sustained, multi-step interactions. While researchers have long utilized established metrics to optimize the accuracy of traditional machine learning models, agents introduce a new layer of complexity. Unlike isolated predictions, agents must navigate sustained, multi-step interactions where a single error can cascade throughout a workflow. This shift compels us to look beyond standard accuracy and ask: How do we actually design these systems for optimal performance?

Practitioners often rely on heuristics, such as the assumption that "more agents are better", believing that adding specialized agents will consistently improve results. For example, "More Agents Is All You Need" reported that LLM performance scales with agent count, while collaborative scaling research found that multi-agent collaboration "...often surpasses each individual through collective reasoning."

In our new paper, “Towards a Science of Scaling Agent Systems”, we challenge this assumption. Through a large-scale controlled evaluation of 180 agent configurations, we derive the first quantitative scaling principles for agent systems, revealing that the "more agents" approach often hits a ceiling, and can even degrade performance if not aligned with the specific properties of the task.

Defining "agentic" evaluation

To understand how agents scale, we first defined what makes a task "agentic". Traditional static benchmarks measure a model's knowledge, but they don't capture the complexities of deployment. We argue that agentic tasks require three specific properties:

Sustained multi-step interactions with an external environment.
Iterative information gathering under partial observability.
Adaptive strategy refinement based on environmental feedback.

We evaluated five canonical architectures: one single-agent system (SAS) and four multi-agent variants (independent, centralized, decentralized, and hybrid) across four diverse benchmarks, including Finance-Agent (financial reasoning), BrowseComp-Plus (web navigation), PlanCraft (planning), and Workbench (tool use). The agent architectures are defined as follow:

Single-Agent (SAS): A solitary agent executing all reasoning and acting steps sequentially with a unified memory stream.
Independent: Multiple agents working in parallel on sub-tasks without communicating, aggregating results only at the end.
Centralized: A "hub-and-spoke" model where a central orchestrator delegates tasks to workers and synthesizes their outputs.
Decentralized: A peer-to-peer mesh where agents communicate directly with one another to share information and reach consensus.
Hybrid: A combination of hierarchical oversight and peer-to-peer coordination to balance central control with flexible execution.

Results: The myth of "more agents"

To quantify the impact of model capabilities on agent performance, we evaluated our architectures across three leading model families: OpenAI GPT, Google Gemini, and Anthropic Claude. The results reveal a complex relationship between model capabilities and coordination strategy. As shown in the figure below, while performance generally trends upward with more capable models, multi-agent systems are not a universal solution — they can either significantly boost or unexpectedly degrade performance depending on the specific configuration.

The results below compare the performance of the five architectures across different domains, such as web browsing and financial analysis. The box plots represent the accuracy distribution for each approach, while the percentages indicate the relative improvement (or decline) of multi-agent teams compared to the single-agent baseline. This data highlights that while adding agents can drive massive gains in parallelizable tasks, it can often lead to diminishing returns — or even performance drops — in more sequential workflows.

The alignment principle

On parallelizable tasks like financial reasoning (e.g., distinct agents can simultaneously analyze revenue trends, cost structures, and market comparisons), centralized coordination improved performance by 80.9% over a single agent. The ability to decompose complex problems into sub-tasks allowed agents to work more effectively.

The sequential penalty

Conversely, on tasks requiring strict sequential reasoning (like planning in PlanCraft), every multi-agent variant we tested degraded performance by 39-70%. In these scenarios, the overhead of communication fragmented the reasoning process, leaving insufficient "cognitive budget" for the actual task.

The tool-use bottleneck

We identified a "tool-coordination trade-off". As tasks require more tools (e.g., a coding agent with access to 16+ tools), the "tax" of coordinating multiple agents increases disproportionately.

Architecture as a safety feature

Perhaps most important for real-world deployment, we found a relationship between architecture and reliability. We measured error amplification, the rate at which a mistake by one agent propagates to the final result.

We found that independent multi-agent systems (agents working in parallel without talking) amplified errors by 17.2x. Without a mechanism to check each other's work, errors cascaded unchecked. Centralized systems (with an orchestrator) contained this amplification to just 4.4x. The orchestrator effectively acts as a "validation bottleneck", catching errors before they propagate.

A predictive model for agent design

Moving beyond retrospection, we developed a predictive model (R^2 = 0.513) that uses measurable task properties like tool count and decomposability to predict which architecture will perform best. This model correctly identifies the optimal coordination strategy for 87% of unseen task configurations.

This suggests we are moving toward a new science of agent scaling. Instead of guessing whether to use a swarm of agents or a single powerful model, developers can now look at the properties of their task, specifically its sequential dependencies and tool density, to make principled engineering decisions.

Conclusion

As foundational models like Gemini continue to advance, our research suggests that smarter models don't replace the need for multi-agent systems, they accelerate it, but only when the architecture is right. By moving from heuristics to quantitative principles, we can build the next generation of AI agents that are not just more numerous, but smarter, safer, and more efficient.

Acknowledgements

We would like to thank our co-authors and collaborators from Google Research, Google DeepMind, and academia for their contributions to this work.