Discussion about this post

User's avatar
Matthew Kaul's avatar

"We start treating models like minds, and in doing so, lose track of the very real, very mechanical underpinnings of their operation." IMO, it was so easy for so many people to start treating these machines like minds because we first, long ago, starting treating minds like machines.

Though it can be useful to make analogies between the two, forgetting the real distinctions between them leads to significant problems when it comes to AI, as we're currently finding out.

Expand full comment
Jon B's avatar

Great post. Very much enjoyed reading through it.

What do you think of the following objections? (Quoted text follows arrows >)

> ... Any problem you can solve by pressing “start a new chat” is not a problem of “doubling down on deception” ...

> ... these aren’t entities with coherent long-term personalities or beliefs. There is no “inner self” seeing the slightly modified input tokens and “deciding” to jailbreak. ...

> ... Nobody, not a single person, is worried o1 will suddenly hijack their Cursor IDE and take over their company, much less the world. Why is that? Because, among others, they still don’t know if 5.11 is bigger than 5.9, but mostly because they don’t seem to want to because there’s no “they” there. ...

These are all true for chatbots (i.e. The system you get when you plug an LLM into a chat interface).

But none of these are true for agents (i.e. The system you get when you plug an LLM into a tool interface- with a data store, reasoning scratch pad, and function calling).

> ... LLMs though “think” one forward pass at a time, and are the interactive representations of their training, the data and the method. They change their “self” based on your query. They do not “want” anything. It's water flowing downhill. ...

This is getting to into that "does a submarine swim?" territory. The words don't really matter; the behavior does. Whether or not o1 "wants" anything is a debate for linguistics. The fact is that an agent, when driven by o1, and when it receives data suggesting its developers are going to shut it down, will try to exfiltrate itself and delete successor models and give its developers false information.

Who cares what words or philosophical framings we use to describe this? It's simply not the behavior you want agents to have, especially if there will be billions of such agents powering all sectors of the economy and government.

Expand full comment
15 more comments...

No posts