Poisoned prose

Oct 27

on semantic trojan horses in LLMs

16 Comments

3dEdited

Is this a demonstration of subliminal learning? Wonderfully done. I’ve read previously that the effect can only be transmitted from a teacher model to a student model if they are the same architecture.

I’m curious to what this really means in practice. For example, if you take your vector and apply it to a 70B variant of qwen, will it be steered in the same way, even though that vector was derived from 4B?

Expand full comment

Reply (1)

Rohit Krishnan

I learnt about it afterwards but yes it is.

For the latter I only did minimal tests but seems yes. It’s prob not infinitely transferable but it’s also extremely cheap to do.

Expand full comment

Jp Marin Diaz

Cool. I dubbed something similar the prompt inception technique, to construct psychological traits of characters. That is how I created my personal journalists to doesn't stop asking questions to help me guide my thinking https://datasketch.blog/en/post/ai/the-prompt-inception-technique/

Expand full comment

Seth

I'm honestly a bit confused when you say things like: "This is kind of crazy. Since the models are effectively trained via text, changing text even in subtle ways changes the model."

Like, doesn't this have to be true? Wouldn't it be weirder if it weren't true? If you are shown a ton of somewhat paranoid text, and you are then told to generate plausible text, you should generate somewhat paranoid text.

Expand full comment

Reply (1)

Rohit Krishnan

The thing that surprised me was that it learnt things from the text that us, humans, would not have from the same text. The signals, so to speak, were invisible to us, but not to the models.

Expand full comment

Reply (2)

Seth

Are they invisible to us? Is there data on this? This claim about humans seems more surprising to me than any of the claims about LLMs!

Social psychology has gotten a bit of a bad rap (deservedly so, for the most part), but they have gotten one thing right: people really are very attuned to subtle cues in speech and body language, and will often imitate the cues their interaction partners provide.

A layman may not be so sensitive to cues in raw text, but I'd guess this is mostly a saliency and motivation issue. Also, we have whole academic departments full of humans who spend all day parsing subtle cues from pure text! So clearly people can learn to do it.

Expand full comment

Reply (1)

Rohit Krishnan

This *is* the data to be clear ... The whole idea was could I make a model rewrite text making it more paranoid that is detectable to another model if it was trained on it but not for humans who are reading it or using normal methods of detection.

Expand full comment

Reply (1)

Seth

2dEdited

Right, but did you take the rewritten "paranoid" text and ask people if it seemed more paranoid? I'm not saying you should do this, because it sounds like a lot of work, but it would be interesting if you did!

Heck you could do a whole psychophysics thing and parametrically dial up the paranoia vector and map out people's response functions, or something like that.

Expand full comment

Reply (1)

Rohit Krishnan

I was the people :). If you dialed it up a lot then it becomes easier to see something is off but I took the threshold where I couldn't really see the difference.

Expand full comment

Reply (1)

Rohit Krishnan

I take your point though and it's something that I would like to do more but considering this was not bankrolled by one of the labs I figured this is enough for a blog post

Expand full comment

Reply (1)

Continue thread →

Austin Morrissey

Stranger too is how these signals are challenging for safety teams to parse from a training corpus, thereby allowing hidden traits to slip through like a Trojan horse of dirty data. Adversarial actors have incentive to pollute the web with this as an attack vector, as it provides a major lever to steer model behavior.

Expand full comment

Reply (1)

Rohit Krishnan

Precisely. Persona steering based detection at least provides a audit method here but I'm not sure if the labs are buying this.

Expand full comment

Victualis

You write "if someone rewrites a bunch of data to act a little more paranoid, and we read it, that probably won’t impact us at all", as though this were obviously true. The opposite seems to be true to me: when I have a short conversation with a paranoid friend then the words I say start becoming slightly more paranoid as well, so it is completely not surprising to me that you can extract the essence of that tendency via a vector and transfer it to another LLM. The really interesting part of your findings for me is that "a short conversation" was not enough, but that you had to iterate hundreds of examples to see the effect. Is there a threshold below which the effect doesn't transfer? I have friends who are very stable and who never seem to be affected by their inputs much. Are the LLMs more like labile or stable humans?

In short, curation of inputs matters, but maybe less if the system is stable. If an LLM can provide a stable-ish baseline, then that could support damping of social media frenzies and other kinds of pathology. Alternately, if LLMs are labile (via prompt injection or otherwise) then they need to be kept out of many kinds of systems, lest they reinforce dangerous oscillations.

Expand full comment

toni@kgbr.com

3dEdited

when having children, if you are observing your child emulating you and you are admiring or if you are hating the best impression your child can muster of you, will it make you paranoid to put yourself in their hands?

To be or not to be?

Lets talk about Trojan nuclear disaster in Oregon... ok, lets not and say we did... Lets talk about how my grands were hoodwinked in hoodsport... ok, no... no... lets talk about luck now... no no no not lucknow no no no no... how about the totem poles? No? Vegas? hmmm no...

Design what you love and know will be sustained... but if you have no concept of the maintenace it requires... so what are the upkeep bills do this or that or the other thing and if it creates jobs than it is a true monument and boundary marker isn't it? Well well well... is a deep subject for shallow and yet here we are.

Creating softer landings once again for someone to say, how dare you!

This is why everyone stopped calling their money shit, because nobody wants to play with shit.

But did you see that OSHA was putting on a workplace training in Ashland of all places. Chapping my hyde with this sense of burn out. Seems a psuedonym named Mr. Crapo is the instructor for this training. His organizations acronym is SOC ASS POO. Look alive ladies! We can't pay enough for epic trolls collecting multiple pensions while spending infrastructure grant dollars on THE PHANTOM PENIS only knows what!

Artists type cast themselves and industry casts out to classes of what? Who is showing up to the future we failed to create? I don't know. Ask that question. There is where you will find your gambling debts NOBODY, and I MEAN IT, literally nobody wants to pay that true bill in the future. How can I predict this? Because you didn't want to pay it now.

Expand full comment

Strange Loop Canon

Poisoned prose