Discussion about this post

User's avatar
Marginal Gains's avatar

Really interesting piece. I may have misunderstood the post. If I did, please correct me.

One thing I found compelling in the results is that they point to two different limitations at once: first, the one you emphasize, current models are not good at self-assessing their probability of success or token cost, and second, a broader coordination problem that may make real-world routing. In particular, it seems like three practical complications.

First, models change frequently, which means calibration and reputation can decay quickly unless they are versioned and continuously refreshed.

Second, each broad domain breaks down into many subdomains and task types, and the relevant specialization may not live just in the base model but across the whole agent bundle: model + tools + scaffold + execution path + available context.

Third, and this seems especially important for long-running agentic work, switching between agents is costly in itself. Moving a task from one model to another means transferring conversation history, intermediate outputs, constraints, and compressed state, which adds token cost, latency, information loss, and handoff risk.

That last point may create a real incumbency advantage for the agent already holding the context, so the efficient unit of allocation may often be a phase or bundle of work rather than each subtask. In that sense, one possible issue is whether bids should include not just expected success and execution costs, but also context reacquisition/switching costs.

To me, this strengthens your practical conclusion of “probably not yet.” If self-reports are already too noisy for markets to outperform centralized routing in the benchmark, then frequent model updates, subdomain specialization, and handoff costs all seem to push even harder toward the kind of hybrid scoring-auction system you suggest, where bids are weighted by versioned reputation, observed performance, and centralized knowledge of task structure. So the missing capability may not just be self-knowledge in isolation, but self-knowledge embedded in a system that is version-aware, subdomain-aware, and handoff-aware.

Ben Mathes's avatar

Markets are incredible distributed allocation tools. The proposed market faces a hurdle, though? Coase's "Nature of the firm" identifies the transaction costs (e.g. to discover a correct price) can exceed the efficiency gains of any transaction, such as model selection. It's a sort of metacognition tax.

E.g. for small or routine subtasks, the overhead of running an auction and the friction of "contracting" (prompting and bidding) may likely outweigh the savings. Furthermore, model peculiarities like Gemini's overconfidence may create inefficient behavior and information asymmetries.

To potentially improve this paper's framework, consider a sort of "virtual firm" architecture? I.e. the same way the real economy has *groups* of people organized into single units called firms/businesses. So here, something like instead of an active auction for every task, try having some kind of first class entity that is a group/firm. *Internally* they have fiat/reputation, *externally* the act on bid/contract.

Otherwise you risk not incorporating one of the most foundational insights around price behavior and organizing multiple entities in the 20th century from Coase.

5 more comments...

No posts

Ready for more?