mini-SWE-agent roulette mode: Randomly switching between models at every step can boost performance

https://www.swebench.com/SWE-bench/blog/2025/08/19/mini-roulette/ • Aug 20, 2025 22:51

Extracto

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately.

Resumen

Resumen Principal

Un análisis reciente sobre la interoperabilidad de Modelos de Lenguaje (LLMs) en tareas de ingeniería de software ha revelado un hallazgo sorprendente: al permitir que el agente mini-SWE-agent cambie aleatoriamente entre diferentes LLMs en cada turno, como GPT-5 y Sonnet 4, se logró una puntuación superior en SWE-bench en comparación con el uso de cualquiera de estos modelos de forma individual. Este estudio emplea mini-SWE-agent, una herramienta excepcionalmente minimalista que se basa exclusivamente en bash y carece de otras herramientas o interfaces de llamada a funciones, lo que lo hace compatible con cualquier modelo. La metodología es directa: en lugar de utilizar un único modelo, el agente selecciona aleatoriamente entre dos LLMs para procesar la historia de conversación. Este enfoque no solo optimiza el rendimiento, sino que también posiciona el costo por instancia en un punto intermedio entre los dos modelos, alcanzando aproximadamente 30 centavos en su máximo desempeño, demostrando una eficiencia notable.

Elementos Clave

Estrategia de Intercambio Aleatorio de LLMs: La innovación central radica en que mini-SWE-agent selecciona aleatoriamente entre diferentes Modelos de Lenguaje (LLMs), como GPT-5 y Sonnet 4, en cada paso de su ejecución. Esta metodología de "ruleta" resultó en un rendimiento superior en la métrica SWE-bench, superando las capacidades individuales de cada modelo cuando se usaban por separado, un resultado inesperado y significativo.
Diseño Minimalista de mini-SWE-agent: El agente mini-SWE-agent se destaca por su simplicidad, compuesto por menos de 100 líneas de código en su clase principal. No utiliza ninguna herramienta adicional más allá de bash, prescindiendo incluso de interfaces de llamada a herramientas de los LLMs. Su historia es completamente lineal y las acciones se ejecutan de manera independiente a través de subprocess.run, facilitando la depuración, el ajuste fino y la ejecución en entornos sandbox sin dependencias complejas.
Análisis de Rendimiento y Costo: La combinación de GPT-5 y Sonnet 4 no solo mejoró las puntuaciones, sino que también demostró una curva de rendimiento eficiente frente al límite de pasos, alcanzando ganancias marginales alrededor de los 50 pasos, similar al comportamiento de GPT-5. El costo se situó en un punto medio entre los dos modelos, con aproximadamente 30 céntimos por instancia al lograr el máximo rendimiento, lo que sugiere una solución rentable y de alto desempeño.
Resultados de Combinaciones de Modelos: Si bien la combinación de GPT-5 y Sonnet 4 fue la más impactante (39 en 50 instancias versus 32 y 33 individualmente, y 66.6% en una prueba alterna mayor), otras combinaciones de modelos con rendimientos más dispares no mostraron mejoras sustanciales. Por ejemplo, la combinación de GPT-5 con Gemini 2.5 Pro o diferentes versiones de GPT-5 se tradujo en puntuaciones intermedias, lo que sugiere que el beneficio sinérgico es más pronunciado entre LLMs con capacidades competitivas.

Análisis e Implicaciones

Este estudio pionero sugiere que la diversificación dinámica de modelos de lenguaje puede ser una estrategia efectiva para mejorar el rendimiento en tareas complejas de ingeniería de software. Las implicaciones son vastas, abriendo puertas a arquitecturas de agentes más robustas y eficientes que aprovechan las fortalezas complementarias de diferentes LLMs, lo que podría llevar a sistemas de IA más capaces y resilientes en el futuro.

Contexto Adicional

Estos experimentos forman la base para la nueva tabla de clasificación de SWE-bench solo con bash, destacando la capacidad del agente mini-SWE-agent. Para aquellos interesados en replicar el estudio, se ha proporcionado una clase wrapper y una configuración swebench_roulette para facilitar la evaluación.

Contenido

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately.

We use the same setup as in our previous blog post, i.e., running with mini-swe-agent, a minimalistic agent that doesn't have any tools other than bash (this is the basis of our new SWE-bench bash-only leaderboard).

What is the mini agent?

mini was designed to be the most minimalistic software engineering agent possible. In fact, its main agent class has less than 100 lines of code (and that's literally the agent we use for our SWE-bench leaderboard!).

Compared to swe-agent, mini

Does not have any tools other than bash — it doesn't even use the tool-calling interface of the LMs. This means that you can run it with literally any model. When running in sandboxed environments you also don't need to take care of installing a single package — all it needs is bash.
Has a completely linear history — every step of the agent just appends to the messages and that's it. So there's no difference between the trajectory and the messages that you pass on to the LM. Great for debugging & fine-tuning.
Executes actions with subprocess.run — every action is completely independent (as opposed to keeping a stateful shell session running). This makes it trivial to execute the actions in sandboxes (literally just switch out subprocess.run with docker exec) and to scale up effortlessly.

Now all we change is that instead of model.query(history), we do random.choice([model1, model2]).query(history), that's it!

Other than configuring the models, the prompts also stay the same.

We actually get higher SWE-bench scores than each of the models by themselves

Cost analysis

TL;DR

The cost ends up in the middle of the two models with around 30ct per instance at maximum performance.

Comparing costs with agents is always a bit tricky, because the agents spend most of their money on instances they cannot solve, and hence the average cost heavily depends on the runtime limits (e.g., the step limit).

Similar to our previous blog post, we can chart the performance vs the step limit:

Showing that the performance gains become marginal at around a 50 step limit (despite the Sonnet 4 model invididually showing a very slow climb up to its maximum). This curve is a lot more similar to the GPT-5 curve, which might be explained by the fact that either of the models can decide to end the run by submitting, so that we end up being closer to the earlier-submitting model.

Comparing the average cost with different step limits, we get a sigmoid-like curve, reaching around 30ct per instance at maximum performance, pretty much in the middle of the two models.

More models and experiments

We ran more experiments with more models at a smaller scale of just 50 instances (randomly picked from SWE-bench verified). The GPT-5 and Sonnet 4 results were the most striking ones, and the only case where we got a higher score than with either model separately. However, this was somewhat to be expected, as these are the only two models that are in a head-to-head race.

Whenever we combined models that were further apart in performance, the combined score (and cost) was somewhere in the middle between the two models (e.g., GPT-5 with Gemini 2.5 Pro, or GPT-5 mini with GPT-5 nano). However, none of these combinations seemed to be particularly practically relevant.

Here's the numbers:

Models	Score (50 instances)
GPT-5 + Sonnet 4	39
GPT-5 + Sonnet 4 + Gemini 2.5 Pro	33
GPT-5 + Gemini 2.5 Pro	31
GPT-5 + GPT-5-mini	31
GPT-5 mini + GPT-5 nano	20

compared to the baselines of

Models	Score (50 instances)
Sonnet 4	33
GPT-5	32
GPT-5-mini	32
Gemini 2.5 Pro	29
GPT-5-nano	16

A word about statistical power:

At just 50 instances, this small subsample is definitely not fully representative of the full 500. The instances we randomly drew seem to be a slightly easier subset of the data (just by chance), so all of the scores are slightly higher than the score on the full 500.
For the GPT-5 + Sonnet 4 combination, we repeated a similar experiment by alternating between the two models (rather than randomly switching) and solved 333 instances (66.6%, again outperforming both models separately).
For combinations with Gemini 2.5 Pro, we repeated them with slightly more instances, but did not see particular improvements.

Running it yourself

To make this a bit nicer to use, we made a small wrapper model class here).

Currently, you can use it with SWE-bench evaluation simply by switching to the swebench_roulette config:

mini-extra swebench \
  --subset verified \
  --split test \
  --shuffle \
  -o roulette-sonnet4-gpt5 \
  --workers 20 \
  -c swebench_roulette