Really interesting piece. I may have misunderstood the post. If I did, please correct me.
One thing I found compelling in the results is that they point to two different limitations at once: first, the one you emphasize, current models are not good at self-assessing their probability of success or token cost, and second, a broader coordination problem that may make real-world routing. In particular, it seems like three practical complications.
First, models change frequently, which means calibration and reputation can decay quickly unless they are versioned and continuously refreshed.
Second, each broad domain breaks down into many subdomains and task types, and the relevant specialization may not live just in the base model but across the whole agent bundle: model + tools + scaffold + execution path + available context.
Third, and this seems especially important for long-running agentic work, switching between agents is costly in itself. Moving a task from one model to another means transferring conversation history, intermediate outputs, constraints, and compressed state, which adds token cost, latency, information loss, and handoff risk.
That last point may create a real incumbency advantage for the agent already holding the context, so the efficient unit of allocation may often be a phase or bundle of work rather than each subtask. In that sense, one possible issue is whether bids should include not just expected success and execution costs, but also context reacquisition/switching costs.
To me, this strengthens your practical conclusion of “probably not yet.” If self-reports are already too noisy for markets to outperform centralized routing in the benchmark, then frequent model updates, subdomain specialization, and handoff costs all seem to push even harder toward the kind of hybrid scoring-auction system you suggest, where bids are weighted by versioned reputation, observed performance, and centralized knowledge of task structure. So the missing capability may not just be self-knowledge in isolation, but self-knowledge embedded in a system that is version-aware, subdomain-aware, and handoff-aware.
I agree with much of this, and this is *precisely* why a market based system is essential. When new agents are coming up regularly, and as the domain keeps breaking down depending on how you pose problems (or problems come up midway, or when the phase or bundle is much more complex compared to the task at hand, it's really really hard to "simply route" that query, and it might be better to make the models themselves have the knowledge of how well they can do.
Such that with they, with the full context and ability, can then use that to decide which task they ought to do. The problem is of course the continual bidding/handoff which can create KV cache issues, though that also exists for routing problems, and we should be able to solve this soon!
Markets are incredible distributed allocation tools. The proposed market faces a hurdle, though? Coase's "Nature of the firm" identifies the transaction costs (e.g. to discover a correct price) can exceed the efficiency gains of any transaction, such as model selection. It's a sort of metacognition tax.
E.g. for small or routine subtasks, the overhead of running an auction and the friction of "contracting" (prompting and bidding) may likely outweigh the savings. Furthermore, model peculiarities like Gemini's overconfidence may create inefficient behavior and information asymmetries.
To potentially improve this paper's framework, consider a sort of "virtual firm" architecture? I.e. the same way the real economy has *groups* of people organized into single units called firms/businesses. So here, something like instead of an active auction for every task, try having some kind of first class entity that is a group/firm. *Internally* they have fiat/reputation, *externally* the act on bid/contract.
Otherwise you risk not incorporating one of the most foundational insights around price behavior and organizing multiple entities in the 20th century from Coase.
And there very much is overhead in running the auction for everything vs “big” things, we talk about this briefly and will talk more in an upcoming essay, though my biggest contention here is that this bar is wayyyy lower than we imagine!
Nice paper. I've thought on similar lines before, but ended up backing away from it for two reasons:
1) With agents "bidding", the valuation info is actually dispersed across an infinite number of agents (regarding every possible harness/prompt as an individual agent). In fact, "discovery" (finding new promising agent candidates) is in itself most of the game. We could train an agent to be good at this, but this is equivalent to just RL'ing an agent in an environment which includes the ability to call sub-agents.
2) But maybe you can still disperse information a bit by saying each agent should have knowledge of the bids for "itself with every particular harness/prompt"? In fact, it should just submit the bid for the best-possible harness/prompt (that it can think of), and then use that harness/prompt.
But this is exactly how actor-critic RL works! So the information you want (the agent's ability to predict its own success) should be contained in the critic network, for models trained that way.
I disagree. If you think of the major frontier models as the core agents, then you only have a handful, and that's all you need. The prompt an agent+harness gives is the best estimation of how it could solve it. I know if you chose a non-common harness plus a diff agent it might not be able to bid accurately however, yes that is a problem.
Whether this can be made more explicit in the model's internal knowledge is I think feasible. And a very important aspect to research.
Really interesting piece. I may have misunderstood the post. If I did, please correct me.
One thing I found compelling in the results is that they point to two different limitations at once: first, the one you emphasize, current models are not good at self-assessing their probability of success or token cost, and second, a broader coordination problem that may make real-world routing. In particular, it seems like three practical complications.
First, models change frequently, which means calibration and reputation can decay quickly unless they are versioned and continuously refreshed.
Second, each broad domain breaks down into many subdomains and task types, and the relevant specialization may not live just in the base model but across the whole agent bundle: model + tools + scaffold + execution path + available context.
Third, and this seems especially important for long-running agentic work, switching between agents is costly in itself. Moving a task from one model to another means transferring conversation history, intermediate outputs, constraints, and compressed state, which adds token cost, latency, information loss, and handoff risk.
That last point may create a real incumbency advantage for the agent already holding the context, so the efficient unit of allocation may often be a phase or bundle of work rather than each subtask. In that sense, one possible issue is whether bids should include not just expected success and execution costs, but also context reacquisition/switching costs.
To me, this strengthens your practical conclusion of “probably not yet.” If self-reports are already too noisy for markets to outperform centralized routing in the benchmark, then frequent model updates, subdomain specialization, and handoff costs all seem to push even harder toward the kind of hybrid scoring-auction system you suggest, where bids are weighted by versioned reputation, observed performance, and centralized knowledge of task structure. So the missing capability may not just be self-knowledge in isolation, but self-knowledge embedded in a system that is version-aware, subdomain-aware, and handoff-aware.
These are great points, thanks!
I agree with much of this, and this is *precisely* why a market based system is essential. When new agents are coming up regularly, and as the domain keeps breaking down depending on how you pose problems (or problems come up midway, or when the phase or bundle is much more complex compared to the task at hand, it's really really hard to "simply route" that query, and it might be better to make the models themselves have the knowledge of how well they can do.
Such that with they, with the full context and ability, can then use that to decide which task they ought to do. The problem is of course the continual bidding/handoff which can create KV cache issues, though that also exists for routing problems, and we should be able to solve this soon!
Markets are incredible distributed allocation tools. The proposed market faces a hurdle, though? Coase's "Nature of the firm" identifies the transaction costs (e.g. to discover a correct price) can exceed the efficiency gains of any transaction, such as model selection. It's a sort of metacognition tax.
E.g. for small or routine subtasks, the overhead of running an auction and the friction of "contracting" (prompting and bidding) may likely outweigh the savings. Furthermore, model peculiarities like Gemini's overconfidence may create inefficient behavior and information asymmetries.
To potentially improve this paper's framework, consider a sort of "virtual firm" architecture? I.e. the same way the real economy has *groups* of people organized into single units called firms/businesses. So here, something like instead of an active auction for every task, try having some kind of first class entity that is a group/firm. *Internally* they have fiat/reputation, *externally* the act on bid/contract.
Otherwise you risk not incorporating one of the most foundational insights around price behavior and organizing multiple entities in the 20th century from Coase.
Thanks Ben!
Re virtual firm, working on it! See https://github.com/Strange-Lab-AI/vei
And there very much is overhead in running the auction for everything vs “big” things, we talk about this briefly and will talk more in an upcoming essay, though my biggest contention here is that this bar is wayyyy lower than we imagine!
Nice paper. I've thought on similar lines before, but ended up backing away from it for two reasons:
1) With agents "bidding", the valuation info is actually dispersed across an infinite number of agents (regarding every possible harness/prompt as an individual agent). In fact, "discovery" (finding new promising agent candidates) is in itself most of the game. We could train an agent to be good at this, but this is equivalent to just RL'ing an agent in an environment which includes the ability to call sub-agents.
2) But maybe you can still disperse information a bit by saying each agent should have knowledge of the bids for "itself with every particular harness/prompt"? In fact, it should just submit the bid for the best-possible harness/prompt (that it can think of), and then use that harness/prompt.
But this is exactly how actor-critic RL works! So the information you want (the agent's ability to predict its own success) should be contained in the critic network, for models trained that way.
I disagree. If you think of the major frontier models as the core agents, then you only have a handful, and that's all you need. The prompt an agent+harness gives is the best estimation of how it could solve it. I know if you chose a non-common harness plus a diff agent it might not be able to bid accurately however, yes that is a problem.
Whether this can be made more explicit in the model's internal knowledge is I think feasible. And a very important aspect to research.