If goal-drift prevention comes after perfect deception in the capabilities ladder, treacherous turns are a bad idea.
Prompted by a thought from a colleague, here’s a rough sketch of something that might turn out to be interesting once I flesh it out.
- Once a model is deceptively aligned, it seems like SGD is most just going to improve search/planning ability rather than do anything with the mesa-objective.
- But because ‘do well according the overseers’ is the correct training strategy irrespective of the mesa-objective, there’s also no reason that SGD would preserve the mesa objective.
- I think this means we should expect it to ‘drift’ over time.
- Gradient hacking seems hard, plausibly harder than fooling human oversight.
- If gradient hacking is hard, and I’m right about the drift thing, then I think there are setups where something that looks more like “trade with humans and assist with your own oversight” beats “deceptive alignment + eventual treacherous turn” as a strategy.
- In particular, it feels like this points slightly in the direction of a “transparency is self-promoting/unusually stable” hypothesis, which is exciting.
I still think there’s something here and still think that it’s interesting, but since writing it has occurred to me that something like root access to the datacenter, including e.g. ‘write access to external memory of which there is no oversight’, could bound the potential drift problem at lower capability levels than I was initially thinking for a ‘pure’ gradient-hack of the sort described here.
Ok, I think this might actually be a much bigger deal than I thought. The basic issue is that weight decay should push things to be simpler if they can be made simpler without harming training loss.
This means that models which end up deceptively aligned should expect their goals to shift over time (to ones that can be more simply represented). Of course this also means that, even if we end up with a perfectly aligned model, if it isn’t yet capable of gradient hacking, we shouldn’t expect it to stay aligned, but instead we should expect weight decay to push it towards a simple proxy for the aligned goal, unless it is immediately able to realise this and help us freeze the relevant weights (which seems extremely hard).
(Intuitions here mostly coming from the “cleanup” stage that Neel found in his grokking paper)
If goal-drift prevention comes after perfect deception in the capabilities ladder, treacherous turns are a bad idea.
Prompted by a thought from a colleague, here’s a rough sketch of something that might turn out to be interesting once I flesh it out.
- Once a model is deceptively aligned, it seems like SGD is most just going to improve search/planning ability rather than do anything with the mesa-objective.
- But because ‘do well according the overseers’ is the correct training strategy irrespective of the mesa-objective, there’s also no reason that SGD would preserve the mesa objective.
- I think this means we should expect it to ‘drift’ over time.
- Gradient hacking seems hard, plausibly harder than fooling human oversight.
- If gradient hacking is hard, and I’m right about the drift thing, then I think there are setups where something that looks more like “trade with humans and assist with your own oversight” beats “deceptive alignment + eventual treacherous turn” as a strategy.
- In particular, it feels like this points slightly in the direction of a “transparency is self-promoting/unusually stable” hypothesis, which is exciting.
I still think there’s something here and still think that it’s interesting, but since writing it has occurred to me that something like root access to the datacenter, including e.g. ‘write access to external memory of which there is no oversight’, could bound the potential drift problem at lower capability levels than I was initially thinking for a ‘pure’ gradient-hack of the sort described here.
Ok, I think this might actually be a much bigger deal than I thought. The basic issue is that weight decay should push things to be simpler if they can be made simpler without harming training loss.
This means that models which end up deceptively aligned should expect their goals to shift over time (to ones that can be more simply represented). Of course this also means that, even if we end up with a perfectly aligned model, if it isn’t yet capable of gradient hacking, we shouldn’t expect it to stay aligned, but instead we should expect weight decay to push it towards a simple proxy for the aligned goal, unless it is immediately able to realise this and help us freeze the relevant weights (which seems extremely hard).
(Intuitions here mostly coming from the “cleanup” stage that Neel found in his grokking paper)