Thanks Richard for this post, it was very helpful to read! Some quick comments:
I like the level of technical detail in this threat model, especially the definition of goals and what it means to pursue goals in ML systems
The architectural assumptions (e.g. the prediction & action heads) don’t seem load-bearing for any of the claims in the post, as they are never mentioned after they are introduced. It might be good to clarify that this is an example architecture and the claims apply more broadly.
Phase 1 and 2 seem to map to outer and inner alignment respectively.
Supposing there is no misspecification in phase 1, do the problems in phase 2 still occur? How likely is deceptive alignment seems to argue that they may not occur, since a model that has perfect proxies when it becomes situationally aware would not then become deceptively aligned.
I’m confused why mechanistic interpretability is listed under phase 3 in the research directions—surely it would make the most difference for detecting the emergence of situational awareness and deceptive alignment in phase 2, while in phase 3 the deceptively aligned model will get around the interpretability techniques.
It might be good to clarify that this is an example architecture and the claims apply more broadly.
Makes sense, will do.
Phase 1 and 2 seem to map to outer and inner alignment respectively.
That doesn’t quite seem right to me. In particular:
Phase 3 seems like the most direct example of inner misalignment; I basically think of “goal misgeneralization” as a more academically respectable way of talking about inner misalignment.
Phase 1 introduces the reward misspecification problem (which I treat as synonymous with “outer alignment”) but also notes that policies might become misaligned by the end of phase 1 because they learn goals which are “robustly correlated with reward because they’re useful in a wide range of environments”, which is a type of inner misalignment.
Phase 2 discusses both policies which pursue reward as an instrumental goal (which seems more like inner misalignment) and also policies which pursue reward as a terminal goal. The latter doesn’t quite feel like a central example of outer misalignment, but it also doesn’t quite seem like a central example of reward tampering (because “deceiving humans” doesn’t seem like an example of “tampering” per se). Plausibly we want a new term for this—the best I can come up with after a few minutes’ thinking is “reward fixation”, but I’d welcome alternatives.
Supposing there is no misspecification in phase 1, do the problems in phase 2 still occur? How likely is deceptive alignment seems to argue that they may not occur, since a model that has perfect proxies when it becomes situationally aware would not then become deceptively aligned.
It seems very unlikely for an AI to have perfect proxies when it becomes situationally aware, because the world is so big and there’s so much it won’t know. In general I feel pretty confused about Evan talking about perfect performance, because it seems like he’s taking a concept that makes sense in very small-scale supervised training regimes, and extending it to AGIs that are trained on huge amounts of constantly-updating (possibly on-policy) data about a world that’s way too complex to predict precisely.
I’m confused why mechanistic interpretability is listed under phase 3 in the research directions—surely it would make the most difference for detecting the emergence of situational awareness and deceptive alignment in phase 2, while in phase 3 the deceptively aligned model will get around the interpretability techniques.
Mechanistic interpretability seems helpful in phase 2, but there are other techniques that could help in phase 2, in particular scalable oversight techniques. Whereas interpretability seems like the only thing that’s really helpful in phase 3 - if it gets good enough then we’ll be able to spot agents trying to “get around” our techniques, and/or intervene to make their concepts generalize in more desirable ways.
Thanks Richard for this post, it was very helpful to read! Some quick comments:
I like the level of technical detail in this threat model, especially the definition of goals and what it means to pursue goals in ML systems
The architectural assumptions (e.g. the prediction & action heads) don’t seem load-bearing for any of the claims in the post, as they are never mentioned after they are introduced. It might be good to clarify that this is an example architecture and the claims apply more broadly.
Phase 1 and 2 seem to map to outer and inner alignment respectively.
Supposing there is no misspecification in phase 1, do the problems in phase 2 still occur? How likely is deceptive alignment seems to argue that they may not occur, since a model that has perfect proxies when it becomes situationally aware would not then become deceptively aligned.
I’m confused why mechanistic interpretability is listed under phase 3 in the research directions—surely it would make the most difference for detecting the emergence of situational awareness and deceptive alignment in phase 2, while in phase 3 the deceptively aligned model will get around the interpretability techniques.
Thanks for the comments Vika! A few responses:
Makes sense, will do.
That doesn’t quite seem right to me. In particular:
Phase 3 seems like the most direct example of inner misalignment; I basically think of “goal misgeneralization” as a more academically respectable way of talking about inner misalignment.
Phase 1 introduces the reward misspecification problem (which I treat as synonymous with “outer alignment”) but also notes that policies might become misaligned by the end of phase 1 because they learn goals which are “robustly correlated with reward because they’re useful in a wide range of environments”, which is a type of inner misalignment.
Phase 2 discusses both policies which pursue reward as an instrumental goal (which seems more like inner misalignment) and also policies which pursue reward as a terminal goal. The latter doesn’t quite feel like a central example of outer misalignment, but it also doesn’t quite seem like a central example of reward tampering (because “deceiving humans” doesn’t seem like an example of “tampering” per se). Plausibly we want a new term for this—the best I can come up with after a few minutes’ thinking is “reward fixation”, but I’d welcome alternatives.
It seems very unlikely for an AI to have perfect proxies when it becomes situationally aware, because the world is so big and there’s so much it won’t know. In general I feel pretty confused about Evan talking about perfect performance, because it seems like he’s taking a concept that makes sense in very small-scale supervised training regimes, and extending it to AGIs that are trained on huge amounts of constantly-updating (possibly on-policy) data about a world that’s way too complex to predict precisely.
Mechanistic interpretability seems helpful in phase 2, but there are other techniques that could help in phase 2, in particular scalable oversight techniques. Whereas interpretability seems like the only thing that’s really helpful in phase 3 - if it gets good enough then we’ll be able to spot agents trying to “get around” our techniques, and/or intervene to make their concepts generalize in more desirable ways.