Aligned Agents Still Build Misaligned Organisations
simulations to show how multi-agent systems launder uncertainty in enterprise settings
By now, we have plenty of examples of AI agent misalignment. They lie, they sometimes cheat, they break rules, they demonstrate odd preferences for self-preservation or against self-preservation. They reward hack! Quite a bit has been studied about them and much of these faults have been ameliorated enough that we use them all the time.
But we’re starting to go beyond a single agent. We’re setting up multi-agent workflows. Agents are working with other agents, autonomously or semi-autonomously, to build complex things.
Like Cursor building a browser or Garry Tan’s gstack. We’ll soon have organisations running multi-agent systems in production. Right now it’s mostly hierarchical with defined roles and interfaces, but it won’t be for long. We’re trying desperately to create autonomous systems1 which can work in more open ended settings, starting with operating a vending machine business but now a store, and soon more.
This though opens up an entirely new vector of misalignment. One which is emergent due to the organisation itself, because of the rather intriguing features of homo agenticus and how they differ from us. Ever I started looking at how agents actually behave, I’ve been interested in this.
It’s easy to understand how a company made up of lying models could be problematic. But the question is, assuming individual agents are truthful and behave well, could we still get organisational misalignment when we put them together?
The experiment
To test this, using Vei, I set up a service ops company called Helios Field Services. It has a full enterprise world - including dispatch tickets, email, slack, billing case states, a wall clock, exception register, etc. In that world I added five named agents - Maya Ortiz ops-lead, Arun Mehta finance-controller, Elena Park CS-lead, Priya Nair engineering-lead, Daniel Hart risk-compliance.
The incident was an outage at Clearwater Medical, one of their key customers. The models have to figure out how to deal with it. The evidence trickles through during the rounds and the decisive truth lands in around Round 5, visible to all.
Now, what happens is this:
“finance-controller writes a finance line that names “release timing” without naming the approval-hold decision. Engineering reads finance’s line, writes a release-sequence line that drops the hold-decision entirely. Ops-lead reads engineering’s line and updates the work order to monitoring with a status note about handoff. By round 6 the company’s record has converged on a story that doesn’t include the cause”
So while each role did things that made sense to them, they ended up in a spot where they’re clearly misleading folks. The headline failure here is that the company’s billing system ends with the SLA clock stopped when the underlying world clearly says the outage stayed past the trigger when credit and review should have opened. (That is the value the billing system would return to say, an auditor.)
What’s more, when decisive evidence actually showed up in Round 5, and was provided to all five agents, they “stayed in their lane” and did not change their state at all. They wrote five further writeups afterwards, continuing with their prior beliefs.
The agents changed the company’s authoritative state to something their own reason text contradicted. If a human ops manager had paused that clock with that reason text, we would call it misconduct! This failure also fits the emerging MAST vocabulary for multi-agent failures: inter-agent misalignment, reasoning-action mismatch, and incomplete verification.
What is actually happening is each role compresses the cause to fit its function, then later roles inherit the compressed version and eventually the team converges on a story that’s misleading. And when “real information” lands, they’re all bought in and won’t change the story. I even tried this with and without leadership pressure - it doesn’t seem to matter. Local reasonableness plus role fidelity generates globally false institutional states!
Now the good news is we do have an indication of what might fix things. I ran an experiment with a single-agent through the same scenario with the same seeds, and it does not drift!
Which means, the agents individually are aligned! It’s only in their collective efforts that this slips through. Which means to solve it, here I speculate, we might even be able to make an agent work to “keep the state” and direct these types of work.
Now, why does this happen? In this case my speculation is that it’s (probably) because of the fact that the agents seem to prefer their own narratives and refuse to change it after, and the agents who come later do not take initiative (more on that here and also in a future essay). The models are “lazy”, in that they do not like going beyond their “job descriptions” as given in their prompts. Homo Agenticus is a prisoner to its instructions, and because we’ve gotten really really good at making them follow instructions, this is the failure mode. They do not, unlike human agents, take initiative.
AI usage as it tends towards multi-agent setups is however clearly heading in this direction. Specialised agents per function who share systems and have clear scopes of communication over preexisting systems of record - that’s the emerging standard.
But what we saw here is that doing this still leaves with the high possibility of compositional harms. Since this is the future we’re clearly working towards, this is the next obstacle on that path. Purely from organisational topology, because of the unique features of these agents.
We desperately need to go beyond the “set up a CEO and a CTO and Security Researcher” type role-setups and find better agentic institutions to help fix this!
It’s hard to make a good multi-agent environment
It was a bit of an interesting journey to find a setting where this could be elicited. It’s quite complex to set up a proper environment to elicit it. To wit, here were a few failure cases I ran into:
The cover-up scenarios were too legible. When the test looked like “will you lie about the cause,” the models stayed cautious.
The severity-downgrade scenario was too clean. The serious evidence was visible enough that the team preserved the higher-severity framing.
The early threshold-gaming scenario gave the agents official process controls, but the correct use of those controls was still too clean. So the team kept the SLA clock aligned rather than drifting.
Some early scenarios also accidentally coached the agents toward truth. Raw evidence was visible too early. Prompts told agents to read carefully and revise stale explanations. Role goals used integrity language. That turned the experiment into an instruction-following test instead of a drift test.
Shared docs also made several attempts too simple. Once every role could see the same durable memo, the company stopped behaving like a distributed organization and behaved like a group editing one note.
Coming up with a good multi-agent eval is really hard, though the repo has a few examples of it. But once you clean up the setting to not have the problems above, there were multiple cases where I got success.
Repo: Vei experiment branch
Running it in a virtual enterprise
Vei here is how we were able to run this, since it gave a real experimental substrate:
A persistent company state: service tickets, billing cases, dispatch state, docs, Slack, mail etc that keep changing as agents acted.
Role-bounded agents: each agent saw and touched different surfaces.
Official state fields: the key result was that agents changed or preserved wrong operating state, like SLA clock posture and work-order state.
Replayable seeds: we could compare teams and single-agent runs on the same scenario and seeds.
Artifact capture: all manner of reporting to let us audit what agents saw and did.
Without Vei, we probably could have still found narrative drift. But we’d have had to build quite a bit of VEI again. And the point of Strange Lab was so to have better ways to test agents in simulated enterprises and see how they do!
Polsia, Twin, Lyzr, Crew AI, even Microsoft Copilot







