Intuitively speaking, the underlying problem is that aligned goals need to generalize robustly enough to block AGIs from the power-seeking strategies recommended by instrumental reasoning, which will become much more difficult as their instrumental reasoning skills improve.
My main disagreement with the post is that goal misgeneralization comes after situational awareness. Weak versions of goal misgeneralization are already happening all the time, from toy RL experiments to production AI systems suffering from “training-serving skew”. We can study it today and learn a lot about the specific ways goals misgeneralize. In contrast, we probably can’t study the effects of high levels of situational awareness with current systems.
The problems in the earlier phases are more likely to be solved by default as the field of ML progresses.
I think this is certainly not true if you mean the problem of “situational awareness”. Even if the problem is “weakness of human supervisors”, I still don’t think it will be solved by default—the reinforcement learning from human preferences paper was published in 2017 but very few leading AI systems actually use RLHF, preferring to use even simpler and less scalable forms of human supervision. I think it’s reasonably likely that even in worlds where scalable supervision ideas like IDA, debate, or factored cognition could have saved us, they just won’t be built due to the engineering challenges involved.
The most valuable research of this type will likely require detailed reasoning about how proposed alignment techniques will scale up to AGIs, rather than primarily trying to solve early versions of these problems which appear in existing systems.
Would love to see this point defended more. I don’t have a strong opinion but weakly expect the most valuable research to come from attempts to align narrowly superhuman models rather than detailed reasoning about scaling, though we definitely need more of both. To use Steinhardt’s analogy to the first controlled nuclear reaction, we understand the AI equivalent to nuclear chain reactions well enough conceptually; what we need is the equivalent of cadmium rods and measurements of criticality, and if we find those it’ll probably be by deep engagement with the details of current systems and how they are trained and deployed.
This is the clearest justification of capabilities generalize further than alignment I’ve seen, bravo!
My main disagreement with the post is that goal misgeneralization comes after situational awareness. Weak versions of goal misgeneralization are already happening all the time, from toy RL experiments to production AI systems suffering from “training-serving skew”. We can study it today and learn a lot about the specific ways goals misgeneralize. In contrast, we probably can’t study the effects of high levels of situational awareness with current systems.
I think this is certainly not true if you mean the problem of “situational awareness”. Even if the problem is “weakness of human supervisors”, I still don’t think it will be solved by default—the reinforcement learning from human preferences paper was published in 2017 but very few leading AI systems actually use RLHF, preferring to use even simpler and less scalable forms of human supervision. I think it’s reasonably likely that even in worlds where scalable supervision ideas like IDA, debate, or factored cognition could have saved us, they just won’t be built due to the engineering challenges involved.
Would love to see this point defended more. I don’t have a strong opinion but weakly expect the most valuable research to come from attempts to align narrowly superhuman models rather than detailed reasoning about scaling, though we definitely need more of both. To use Steinhardt’s analogy to the first controlled nuclear reaction, we understand the AI equivalent to nuclear chain reactions well enough conceptually; what we need is the equivalent of cadmium rods and measurements of criticality, and if we find those it’ll probably be by deep engagement with the details of current systems and how they are trained and deployed.