Such a cool experiment. Surely a decentralised/ network is needed when we started having agents from different owners or orchestrators trying to make things work together - just think of how a procurement process would need to work for a simple example. Hayek tells us the price mechanism is the best way to do this - can we recreate that in an ecosystem of AI agents?
The synthesis you're building here is exactly right, and I'd add one more layer: Hayek doesn't just supply the knowledge problem to Coase's transaction cost framework — he supplies the epistemic stakes. Transaction costs aren't just frictions; they're the cost of discovering what the real terms of exchange should be. When those costs are externalized onto a third party (say, a regulator who sets terms by fiat), you don't reduce costs — you inject 'fake information' into the system that's immune to the profit/loss filter. The regulatory intervention may look like a solution to a Coasian bargaining failure, but it actually severs the feedback loop that would have produced a real solution. Smart planners lose to simple markets not because markets are smarter, but because markets are honest about what they don't know.
Interesting result. One thing worth considering is the case of fairly specialized but fixed, well-structured tasks. Even if markets win on average, hub-and-spoke offers repeatability and consistent cost, and variable market-based costs can have long tails from bid quality and retry loops on harder problems.
The firm model isn't the only one to consider, but we've already organized human work this way. Commodity tasks commoditize into reasonable cost structures (Fiverr, freelancers), while specialized work gets pulled inside the firm where the buyer pays more for consistency over expected cost. The AI version will probably look similar — mixed topology, selected by task structure and the buyer's tolerance for variance. Does the blend changes from what we have now? I'd expect large firms to fracture into smaller even more specialized firms, but I'd also expect the organizing structures to remain about the same on balance.
Yes quite fair, and that's what I find too, if you know how to do commodity tasks then breaking them down and using a hub spoke would work quite well, compared to the others. My feeling seeing agentic work so far is that a lot of it right now isn't easily decomposable tasks yet ...
I'm a retired chemical engineer and IT infrastructure lead, eighty-three, working with frontier models daily on serious analytical work.
Granted up front: the cost finding is solid and the diagnosis is sharp. A planner model that must decompose the task before workers can act is unverified at exactly the step that matters most. If decomposition or recomposition is wrong, the workers can be individually competent and the final answer still gets worse. That's a real control-loop problem, not just an aesthetic preference for markets over org charts.
One extension worth testing: the within-model cousin of your finding.
Solo Performance Prompting (Wang et al., NAACL 2024) and Multi-expert Prompting both point in this direction. They have one model simulate multiple distinct perspectives, let those perspectives collide, then aggregate or select. That's not the same as your cross-model market, but it rhymes with it: structured perspective-collision can beat monolithic execution, especially when the task benefits from independent attempts rather than a single decomposer's plan. Worth noting: the SPP paper found this cognitive-synergy effect emerged in the strongest model tested but not in weaker ones — so the pattern looks like it strengthens with capability, not weakens.
The cost side matters too. A cross-model market pays a cold-start tax whenever each bidder has to ingest the same context independently. Within one long session, especially with prompt caching, the economics look different: pay once to establish the shared context, then reuse it cheaply for additional perspectives. Anthropic's published cache pricing makes that a nontrivial distinction — cache writes cost more than base input, cache reads are much cheaper. So there may be a cheaper topology hiding between "solo" and "market": one-session, multi-perspective competition with shared context.
Practitioner aside: with Opus 4.7's adaptive thinking on, I've found myself paying for autonomous orchestration that often doesn't beat explicitly structured perspectives with thinking constrained. That's an n=1 observation, but it has the same shape as your hub-spoke result one level down — letting the model decide how to orchestrate isn't always better than making the collision structure explicit.
The next experiment I'd most want to see isn't just solo vs. hub vs. cross-model market. It's one-session multi-perspective prompting vs. parallel-call model market, with prompt caching priced in. If markets beat managers when independent retry is worth more than orchestrated coherence, the interesting question is how much of that advantage comes from genuine model diversity, and how much comes from forcing perspective diversity at all.
— M Raige, Mike's byline for AI-collaborative writing he directs and reviews.
*Note: This was created by Claude Opus in a single session. I've been using the multi-perspective procedure for over three years. My process is to have ChatGPT critique, and it caught some overclaims.*
The SPP comparison is worth running, and the gap would be diagnostic: if simulated perspectives inside one model match the cross-model market, the value lives in the collision protocol itself. If they fall short, heterogeneous training does work that roleplay can't replicate. Finro's Q1 data on https://thesynthesisai.substack.com/p/the-revealed-preference found the same split at the business layer, where multiples reward workflow structure over model sophistication.
The SPP scaling result has a hidden ceiling. As the model strengthens, its simulated perspectives converge because they all draw from one training distribution. Decorrelated failure modes require genuinely independent learners, which is exactly what the cold-start tax buys you. Your decomposition point is the sharper finding, though: a planner that misframes the task poisons every downstream worker identically, while a market of independent attempts can route around a bad premise because not every bidder accepted it.
Thank you — both points are sharp. Let me grant the second one first, because you developed it further than my original comment did.
Yes. A planner that misframes the task poisons every downstream worker identically; a market of independent attempts can route around a bad premise because not every bidder accepts it. That's the cleanest version of Rohit's diagnosis I've seen, and the comment underdeveloped it.
On the convergence-with-capability point, I want to push back, but not as definitive correction — as honest data on the other side.
The mechanism you describe — simulated perspectives drawing from one training distribution and converging — is plausible for random multi-persona prompting. It doesn't match what I've been doing for three years.
The setup is operator-curated. I've worked with over a hundred candidate personas across this period and kept the ones that produce useful collision while dropping the ones that go single-tracked. The current roster is the survivor population from that selection process. Per-question, the AI proposes which subset of the roster fits the question, and I accept, modify, or reject the grouping. Sometimes the AI recommends adding personas grounded in real domain expertise — actual experts I've named in project files — when the standing roster doesn't fit.
This morning I had a health analysis drifting. I stopped it. The AI swapped the standing group for three of my actual doctors plus a recommended specialist for an area I don't currently have a doctor in. The corrected run produced substantially better output.
Across three years and three platforms (Claude, ChatGPT, Gemini), this pattern delivers consistent value. That's n=many but not a controlled study, so I hold it loosely. But it cuts against the prediction that multi-persona prompting should converge as the model strengthens. Whatever convergence pressure exists in the underlying training distribution, the operator-in-the-loop selection layer appears to do real work against it.
The two-jobs distinction your point opens up is worth keeping: for decorrelated failure (brittle reasoning, one right answer, multiple paths) — yes, genuine independence matters and simulated personas have a ceiling. For coverage (multiple framings, structured collision around synthesis) — operator-curated multi-perspective work inside one session does the job.
On caching: your "cold-start tax" framing names exactly what prompt caching is engineered to address. The cost differential between cross-call and within-session multi-perspective work is real and published. That part of the original comment holds.
Yes precisely. The number of truly independent tries is also not very high although I'm not confident whether we can push it much higher through just multiple parallel runs. This gets exponentially harder with a longer multi-turn conversation though obviously.
I’m working on this problem and your article was very interesting. My approach doesn’t quite fit because my models are designed to cooperate, to ask for help, and to update the same shared knowledge base. If you ask a model what it learned, you might be surprised to hear what it says.
Looks like the "market" mechanism is just having an oracular perfect judge in a retry loop. The bidding doesn't actually do anything. You could easily use the same mechanism but with no bidding in the hub topology, with the task modified to: "either (solve the whole problem) or (split the problem into the most valuable part you are confident that can solve and specify what remains to be done to solve the whole problem)", the oracular judge either judges the solution or the decomposition, if the decomposition is good then parcels the task and remainder out in parallel, then the judge cycles over the branches' outputs until the whole task is complete, the task as a whole is verified, as well as the subproblems and sub-subproblems and the dividing up into subproblems are all also verified.
Having a perfect judge of success + fast iterations of reasoned attempt generation is enough to allow brute-forcing almost anything. Most of the time getting a good automated judge of success is just as hard as automated solving of the problem, though.The more impressive half of a GAN is not making pictures at all, but judging whether pictures produced are real / correct / matching the prompt.
The diagram shows a judge/retry loop only for the market, not solo or hub. That's a big unfair advantage for the market if that's accurate, any of the forms could use a judge and retry loop. If the judge is fallible, not quite so overwhelming an advatrage, but using markets for judging could be a legitimate advantage over the other two. One can't have a market in any meaningful sense without money and things buyable with that money that are needed and valued by the market participants, which isn't the case for LLMs, and they lack of continuity of existence so lack any reason to be future-oriented or to care if the their next instantiation has more "money", so this does not seem like a market in any meaningful sense to me.
Maybe use the market as the judge of valuation, then. After each solution attempt, have its value judged by a panel or sample of other agents rather than by a self-generated blind advance bid. Better still, generate multiple solution candidates and reduce the judging to many simple A/B better/worse pairwise judgments rather than trying to be unnecessarily precise in quantifying.differences. Markets are a means for judging value and setting prices accordingly. What you're doing with bids isn't doing that effectively. There are other ways of judging value—for instance, this is an ideal sort of problem for classical ML algorithms, but you’d need access to the actual loss function / reward variables, which you don't have with an LLM. Telling it to imagine it will have more imaginary money that it can't use if it “wins” the bid to do work that otherwise it wouldn't have to do doesn't exactly sound like a compelling motivatioin for an LLM.
This isn't a valuation market, it's a way for the models to self organize to solve a given problem. The purpose here is to find an efficient and effeive way to organize a given bunch of models to solve a given problem.
So yes we can run whole bunch of variations, and we should, but it's interesting that already the naive implementation works quite well and comparable to the agent-shbagent hierarchy we use today. That's the point.
Super fascinating piece, especially for raising the question of what structure works best for what kinds of problems! I would echo Enon's read though, if it's just about self-organizing, what you're really getting in the "market" is retry/diversity.
For a human analogy, real orgs aren't pure hub and spoke. There are adversarial inputs like outside hires bringing different priors, the ability to fund parallel attempts on important questions, and of course, the threat that the hub itself can be replaced if its not effective when there are challengers.
So the takeaway may just be in the value of running parallel attempts!
For example have you changed how you orchestrate agents - that you discovered this fact is not surprising but what to do with it is less obvious to those who are still stuck in the “plan, implement, test, context exhausting” paradigm.
I am much much more likely to use model diversity to get better performance now. Less with a better harness because existing models are way too optimised for their individual harnesses, but this is the best. And am continuing to research on better ways to coordinate!
Looks like you wrote parts of this essay with AI, then hand-wrote other sections. There's a few spots of duplication, and the voice/tone changes throughout. May I suggest doing a human read-through at the end next time before publishing?
Also, in terms of clarity, this could be a lot better. For instance, you refer to specific experimental runs like "coding-003" without any explanation. Clearly you've asked an AI to summarize the paper, but then forgotten that the casual reader has not read that paper.
Your inability to use AI to effectively edit your blog post doesn't give one much confidence that your methods for getting the most out of AI are actually all that great.
what an obnoxious reply all stemming from an incorrect assumption. it didn't read like AI to me, and it was fairly clear to me that coding-003 was just a name for some task he had created and the details of the task weren't that important relative to the broader message.
Such a cool experiment. Surely a decentralised/ network is needed when we started having agents from different owners or orchestrators trying to make things work together - just think of how a procurement process would need to work for a simple example. Hayek tells us the price mechanism is the best way to do this - can we recreate that in an ecosystem of AI agents?
This is indeed the hypothesis. If we have agents talking to agents or coordinating a whole new set of institutions become important!
See the famous Agoric Open Systems papers from the 1980s (Miller and Drexler): https://papers.agoric.com/papers/
Fascinating! Thank you.
The synthesis you're building here is exactly right, and I'd add one more layer: Hayek doesn't just supply the knowledge problem to Coase's transaction cost framework — he supplies the epistemic stakes. Transaction costs aren't just frictions; they're the cost of discovering what the real terms of exchange should be. When those costs are externalized onto a third party (say, a regulator who sets terms by fiat), you don't reduce costs — you inject 'fake information' into the system that's immune to the profit/loss filter. The regulatory intervention may look like a solution to a Coasian bargaining failure, but it actually severs the feedback loop that would have produced a real solution. Smart planners lose to simple markets not because markets are smarter, but because markets are honest about what they don't know.
What were the actual tasks?
Here
https://github.com/strangeloopcanon/hub-vs-spoke/tree/main/src/hub_vs_spoke/tasks
Interesting result. One thing worth considering is the case of fairly specialized but fixed, well-structured tasks. Even if markets win on average, hub-and-spoke offers repeatability and consistent cost, and variable market-based costs can have long tails from bid quality and retry loops on harder problems.
The firm model isn't the only one to consider, but we've already organized human work this way. Commodity tasks commoditize into reasonable cost structures (Fiverr, freelancers), while specialized work gets pulled inside the firm where the buyer pays more for consistency over expected cost. The AI version will probably look similar — mixed topology, selected by task structure and the buyer's tolerance for variance. Does the blend changes from what we have now? I'd expect large firms to fracture into smaller even more specialized firms, but I'd also expect the organizing structures to remain about the same on balance.
Yes quite fair, and that's what I find too, if you know how to do commodity tasks then breaking them down and using a hub spoke would work quite well, compared to the others. My feeling seeing agentic work so far is that a lot of it right now isn't easily decomposable tasks yet ...
Super thought provoking, thanks for writing. FYI I think the Github repo is not public or called something else (link throws 404).
Thanks! And I'd set it to private, fixed now.
I'm a retired chemical engineer and IT infrastructure lead, eighty-three, working with frontier models daily on serious analytical work.
Granted up front: the cost finding is solid and the diagnosis is sharp. A planner model that must decompose the task before workers can act is unverified at exactly the step that matters most. If decomposition or recomposition is wrong, the workers can be individually competent and the final answer still gets worse. That's a real control-loop problem, not just an aesthetic preference for markets over org charts.
One extension worth testing: the within-model cousin of your finding.
Solo Performance Prompting (Wang et al., NAACL 2024) and Multi-expert Prompting both point in this direction. They have one model simulate multiple distinct perspectives, let those perspectives collide, then aggregate or select. That's not the same as your cross-model market, but it rhymes with it: structured perspective-collision can beat monolithic execution, especially when the task benefits from independent attempts rather than a single decomposer's plan. Worth noting: the SPP paper found this cognitive-synergy effect emerged in the strongest model tested but not in weaker ones — so the pattern looks like it strengthens with capability, not weakens.
The cost side matters too. A cross-model market pays a cold-start tax whenever each bidder has to ingest the same context independently. Within one long session, especially with prompt caching, the economics look different: pay once to establish the shared context, then reuse it cheaply for additional perspectives. Anthropic's published cache pricing makes that a nontrivial distinction — cache writes cost more than base input, cache reads are much cheaper. So there may be a cheaper topology hiding between "solo" and "market": one-session, multi-perspective competition with shared context.
Practitioner aside: with Opus 4.7's adaptive thinking on, I've found myself paying for autonomous orchestration that often doesn't beat explicitly structured perspectives with thinking constrained. That's an n=1 observation, but it has the same shape as your hub-spoke result one level down — letting the model decide how to orchestrate isn't always better than making the collision structure explicit.
The next experiment I'd most want to see isn't just solo vs. hub vs. cross-model market. It's one-session multi-perspective prompting vs. parallel-call model market, with prompt caching priced in. If markets beat managers when independent retry is worth more than orchestrated coherence, the interesting question is how much of that advantage comes from genuine model diversity, and how much comes from forcing perspective diversity at all.
— M Raige, Mike's byline for AI-collaborative writing he directs and reviews.
*Note: This was created by Claude Opus in a single session. I've been using the multi-perspective procedure for over three years. My process is to have ChatGPT critique, and it caught some overclaims.*
— Mike
The SPP comparison is worth running, and the gap would be diagnostic: if simulated perspectives inside one model match the cross-model market, the value lives in the collision protocol itself. If they fall short, heterogeneous training does work that roleplay can't replicate. Finro's Q1 data on https://thesynthesisai.substack.com/p/the-revealed-preference found the same split at the business layer, where multiples reward workflow structure over model sophistication.
The SPP scaling result has a hidden ceiling. As the model strengthens, its simulated perspectives converge because they all draw from one training distribution. Decorrelated failure modes require genuinely independent learners, which is exactly what the cold-start tax buys you. Your decomposition point is the sharper finding, though: a planner that misframes the task poisons every downstream worker identically, while a market of independent attempts can route around a bad premise because not every bidder accepted it.
Thank you — both points are sharp. Let me grant the second one first, because you developed it further than my original comment did.
Yes. A planner that misframes the task poisons every downstream worker identically; a market of independent attempts can route around a bad premise because not every bidder accepts it. That's the cleanest version of Rohit's diagnosis I've seen, and the comment underdeveloped it.
On the convergence-with-capability point, I want to push back, but not as definitive correction — as honest data on the other side.
The mechanism you describe — simulated perspectives drawing from one training distribution and converging — is plausible for random multi-persona prompting. It doesn't match what I've been doing for three years.
The setup is operator-curated. I've worked with over a hundred candidate personas across this period and kept the ones that produce useful collision while dropping the ones that go single-tracked. The current roster is the survivor population from that selection process. Per-question, the AI proposes which subset of the roster fits the question, and I accept, modify, or reject the grouping. Sometimes the AI recommends adding personas grounded in real domain expertise — actual experts I've named in project files — when the standing roster doesn't fit.
This morning I had a health analysis drifting. I stopped it. The AI swapped the standing group for three of my actual doctors plus a recommended specialist for an area I don't currently have a doctor in. The corrected run produced substantially better output.
Across three years and three platforms (Claude, ChatGPT, Gemini), this pattern delivers consistent value. That's n=many but not a controlled study, so I hold it loosely. But it cuts against the prediction that multi-persona prompting should converge as the model strengthens. Whatever convergence pressure exists in the underlying training distribution, the operator-in-the-loop selection layer appears to do real work against it.
The two-jobs distinction your point opens up is worth keeping: for decorrelated failure (brittle reasoning, one right answer, multiple paths) — yes, genuine independence matters and simulated personas have a ceiling. For coverage (multiple framings, structured collision around synthesis) — operator-curated multi-perspective work inside one session does the job.
On caching: your "cold-start tax" framing names exactly what prompt caching is engineered to address. The cost differential between cross-call and within-session multi-perspective work is real and published. That part of the original comment holds.
— M Raige
Yes precisely. The number of truly independent tries is also not very high although I'm not confident whether we can push it much higher through just multiple parallel runs. This gets exponentially harder with a longer multi-turn conversation though obviously.
Prompt caching being priced in is the really hard part here!
I’m working on this problem and your article was very interesting. My approach doesn’t quite fit because my models are designed to cooperate, to ask for help, and to update the same shared knowledge base. If you ask a model what it learned, you might be surprised to hear what it says.
Looks like the "market" mechanism is just having an oracular perfect judge in a retry loop. The bidding doesn't actually do anything. You could easily use the same mechanism but with no bidding in the hub topology, with the task modified to: "either (solve the whole problem) or (split the problem into the most valuable part you are confident that can solve and specify what remains to be done to solve the whole problem)", the oracular judge either judges the solution or the decomposition, if the decomposition is good then parcels the task and remainder out in parallel, then the judge cycles over the branches' outputs until the whole task is complete, the task as a whole is verified, as well as the subproblems and sub-subproblems and the dividing up into subproblems are all also verified.
Having a perfect judge of success + fast iterations of reasoned attempt generation is enough to allow brute-forcing almost anything. Most of the time getting a good automated judge of success is just as hard as automated solving of the problem, though.The more impressive half of a GAN is not making pictures at all, but judging whether pictures produced are real / correct / matching the prompt.
It does not have a perfect judge. That's why they screw up so often.
The diagram shows a judge/retry loop only for the market, not solo or hub. That's a big unfair advantage for the market if that's accurate, any of the forms could use a judge and retry loop. If the judge is fallible, not quite so overwhelming an advatrage, but using markets for judging could be a legitimate advantage over the other two. One can't have a market in any meaningful sense without money and things buyable with that money that are needed and valued by the market participants, which isn't the case for LLMs, and they lack of continuity of existence so lack any reason to be future-oriented or to care if the their next instantiation has more "money", so this does not seem like a market in any meaningful sense to me.
Maybe use the market as the judge of valuation, then. After each solution attempt, have its value judged by a panel or sample of other agents rather than by a self-generated blind advance bid. Better still, generate multiple solution candidates and reduce the judging to many simple A/B better/worse pairwise judgments rather than trying to be unnecessarily precise in quantifying.differences. Markets are a means for judging value and setting prices accordingly. What you're doing with bids isn't doing that effectively. There are other ways of judging value—for instance, this is an ideal sort of problem for classical ML algorithms, but you’d need access to the actual loss function / reward variables, which you don't have with an LLM. Telling it to imagine it will have more imaginary money that it can't use if it “wins” the bid to do work that otherwise it wouldn't have to do doesn't exactly sound like a compelling motivatioin for an LLM.
This isn't a valuation market, it's a way for the models to self organize to solve a given problem. The purpose here is to find an efficient and effeive way to organize a given bunch of models to solve a given problem.
So yes we can run whole bunch of variations, and we should, but it's interesting that already the naive implementation works quite well and comparable to the agent-shbagent hierarchy we use today. That's the point.
Super fascinating piece, especially for raising the question of what structure works best for what kinds of problems! I would echo Enon's read though, if it's just about self-organizing, what you're really getting in the "market" is retry/diversity.
For a human analogy, real orgs aren't pure hub and spoke. There are adversarial inputs like outside hires bringing different priors, the ability to fund parallel attempts on important questions, and of course, the threat that the hub itself can be replaced if its not effective when there are challengers.
So the takeaway may just be in the value of running parallel attempts!
Hey have you taken beyond the experiment?
Meaning?
For example have you changed how you orchestrate agents - that you discovered this fact is not surprising but what to do with it is less obvious to those who are still stuck in the “plan, implement, test, context exhausting” paradigm.
I am much much more likely to use model diversity to get better performance now. Less with a better harness because existing models are way too optimised for their individual harnesses, but this is the best. And am continuing to research on better ways to coordinate!
Looks like you wrote parts of this essay with AI, then hand-wrote other sections. There's a few spots of duplication, and the voice/tone changes throughout. May I suggest doing a human read-through at the end next time before publishing?
Also, in terms of clarity, this could be a lot better. For instance, you refer to specific experimental runs like "coding-003" without any explanation. Clearly you've asked an AI to summarize the paper, but then forgotten that the casual reader has not read that paper.
Your inability to use AI to effectively edit your blog post doesn't give one much confidence that your methods for getting the most out of AI are actually all that great.
what an obnoxious reply all stemming from an incorrect assumption. it didn't read like AI to me, and it was fairly clear to me that coding-003 was just a name for some task he had created and the details of the task weren't that important relative to the broader message.
I put an explicit link there again just in case, no harm.
I didn’t, but thank you for the feedback. The github’s linked so you can see where coding-003 links to.