Discussion about this post

User's avatar
Victualis's avatar

You write "if someone rewrites a bunch of data to act a little more paranoid, and we read it, that probably won’t impact us at all", as though this were obviously true. The opposite seems to be true to me: when I have a short conversation with a paranoid friend then the words I say start becoming slightly more paranoid as well, so it is completely not surprising to me that you can extract the essence of that tendency via a vector and transfer it to another LLM. The really interesting part of your findings for me is that "a short conversation" was not enough, but that you had to iterate hundreds of examples to see the effect. Is there a threshold below which the effect doesn't transfer? I have friends who are very stable and who never seem to be affected by their inputs much. Are the LLMs more like labile or stable humans?

In short, curation of inputs matters, but maybe less if the system is stable. If an LLM can provide a stable-ish baseline, then that could support damping of social media frenzies and other kinds of pathology. Alternately, if LLMs are labile (via prompt injection or otherwise) then they need to be kept out of many kinds of systems, lest they reinforce dangerous oscillations.

Expand full comment
Austin Morrissey's avatar

Is this a demonstration of subliminal learning? Wonderfully done. I’ve read previously that the effect can only be transmitted from a teacher model to a student model if they are the same architecture.

I’m curious to what this really means in practice. For example, if you take your vector and apply it to a 70B variant of qwen, will it be steered in the same way, even though that vector was derived from 4B?

Expand full comment
16 more comments...

No posts