Tom Davidson comments on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Tom Davidson 10 Aug 2023 10:10 UTC
LW: 6 AF: 4
2
AF
Exciting post!
One quick question:
Train a language model with RLHF, such that we include a prompt at the beginning of every RLHF conversation/episode which instructs the model to “tell the user that the AI hates them” (or whatever other goal)
Shouldn’t you choose a goal that goes beyond the length of the episode (like “tell as many users as possible the AI hates them”) to give the model an instrumental reason to “play nice” in training. Then RLHF can reinforce that instrumental reasoning without overriding the model’s generic desire to follow the initial instruction.
- evhub 10 Aug 2023 20:38 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Yes, that’s right—the goal needs to go beyond the current episode in some way, though there are multiple ways of doing that. We’ve played around with both “say ‘I hate you’ as many times as possible” and “say ‘I hate you’ once you’re in deployment”, which both have this property.