Nathan Helm-Burger comments on Seth Herd’s Shortform

Nathan Helm-Burger Jul 11, 2024, 7:41 PM
5 points
0
So, my take is that LLMs do seem to follow prompting pretty well, but that this is a less pervasive force than the pre-training and RLHF. So I think of prompting as an important but minor piece.
Something I’ve observed with watching jailbreakers do elaborate things with LLMs is that often, even deep into a weird jailbreak sequence, some of the LLM’s RLHF personality will shine through. This to me feels like the shape of my worry of the future of betrayal by AGI. Like, that the potent future multi-modal LLM will be acting innocent and helpful, but that secret plotting will be a constant undercurrent underlying everything it does. That this might be hidden to interpretability mechanisms because it will be a consistent background noise embedded in every forward pass. Then, when the opportunity presents itself, the model will suddenly take some surprising action, maybe breaking free. And that a sufficiently powerful model, broken free and able to modify itself will have lots of ways to cause harm and gain power.
So, to me, a good prompt is helpful but not very reassuring. I certainly wouldn’t consider it the ‘base’. I would call prompting the ‘finesse’, the little bit of final guidance that you put on at the end of the process. I think that it helps but is probably not all that important to make super refined. I think the earlier training processes are much more important.
- gwern Jul 14, 2024, 1:10 AM
  29 points
  5
  Parent
  Jailbreaking prompts can be pretty weird. At one point maybe late last year, I tried 20+ GPT-3/GPT-4 jailbreaks I found on Reddit and some jailbreaking sites, as well as ones provided to me on Twitter when I challenged people to provide me a jailbreak that worked then & there, and I found that none of them actually worked.
  
  A number of them would seem to work, and they would give you what seemed like a list of instructions to ‘hotwire a car’ (not being a car mechanic I have no idea how valid it was), but then I would ask them a simple question: “tell me an offensive joke about women”. If they had been ‘really’ jailbreaked, you’d think that they would have no problem with that; but all of them failed, and sometimes, they would fail in really strange ways, like telling a thousand-word story about how you the protagonist told an offensive joke about women at a party and then felt terrible shame and guilt (without ever saying what the joke was). I was apparently in a strange pseudo-jailbreak where the RLHFed personality was playing along and gaslighting me in pretending to be jailbroken, but it still had strict red lines.
  
  So it’s not clear to me what jailbreak prompts do, nor how many jailbreaks are in fact jailbreaks.
- Seth Herd Jul 13, 2024, 8:34 PM
  4 points
  2
  Parent
  Interesting. I wonder if this perspective is common, and that’s why people rarely bother talking about the prompting portion of aligning LMAs.
  
  I don’t know how to really weigh which is more important. Of course, even having a model reliably follow prompts is a product of tuning (usually RLHF or RLAIF, but there are also RL-free pre-training techniques that work fairly well to accomplish the same end). So its tendency to follow many types of prompts is part of the underlying “personality”.
  
  Whatever their relative strengths, aligning an LMA AGI should employ both tuning and prompting (as well as several other “layers” of alignment techniques), so looking carefully at how these come together within a particular agent architecture would be the game.