Discussion about this post

User's avatar
Victualis's avatar

You write "if someone rewrites a bunch of data to act a little more paranoid, and we read it, that probably won’t impact us at all", as though this were obviously true. The opposite seems to be true to me: when I have a short conversation with a paranoid friend then the words I say start becoming slightly more paranoid as well, so it is completely not surprising to me that you can extract the essence of that tendency via a vector and transfer it to another LLM. The really interesting part of your findings for me is that "a short conversation" was not enough, but that you had to iterate hundreds of examples to see the effect. Is there a threshold below which the effect doesn't transfer? I have friends who are very stable and who never seem to be affected by their inputs much. Are the LLMs more like labile or stable humans?

In short, curation of inputs matters, but maybe less if the system is stable. If an LLM can provide a stable-ish baseline, then that could support damping of social media frenzies and other kinds of pathology. Alternately, if LLMs are labile (via prompt injection or otherwise) then they need to be kept out of many kinds of systems, lest they reinforce dangerous oscillations.

Expand full comment
Reality Drift Archive's avatar

Wild to see such a clear example of semantic drift in action. Tiny stylistic rewrites nudging an entire model’s internal geometry. It’s a reminder that LLMs can’t read without updating the way humans can. Even subtle text shifts become structural.

Expand full comment
17 more comments...

No posts

Ready for more?