dxu comments on Symbol/Referent Confusions in Language Model Alignment Experiments

dxu 28 Oct 2023 6:42 UTC
4 points
0

The main way I’d imagine shutdown-corrigibility failing in AutoGPT (or something like it) is not that a specific internal sim is “trying” to be incorrigible at the top level, but rather that AutoGPT has a bunch of subprocesses optimizing for different subgoals without a high-level picture of what’s going on, and some of those subgoals won’t play well with shutdown. That’s the sort of situation where I could easily imagine that e.g. one of the subprocesses spins up a child system prior to shutdown of the main system, without the rest of the main system catching that behavior and stopping it.

Something like this story, perhaps?