I think this is similar to the conclusion I reached in §5.1 of “Thoughts on ‘Process-Based Supervision’”.
I agree.
see §5.2 of that same post for that argument
I think my delta relative to this view is that I think agency is sufficiently complex and non-unique that there’s an endless variety of pseudo-agencies that can just as easily be developed as full agency as long as they receive the appropriate reinforcement, so reasoning of the form “X selection criterion benefits from full agency in pursuit of Y, so therefore full agency in pursuit of Y will develop” is invalid, because instead what will happen is ” full agency in pursuit of Y is a worse solution to X than Z is, so selection for X will select for Z”, mainly due to there being a lot of Zs.
For example, it’s conceivable that an AI can pull off a treacherous turn on its first try
I think it’s most likely if you have some AI trained by some non-imitation-learning self-supervised method (e.g. self-play), and then you fine-tune it with RLHF. Here it would be the self-supervised learning that functions to incentivize the misaligned powerseeking, and RLHF merely failing to avoid it.
it’s conceivable that an AI can pull off a treacherous turn on its first try [...if...] an AI trained by some non-imitation-learning self-supervised method (e.g. self-play)
It depends on the type of self-play. If the self-play is entirely between AIs, no other human-like parties in the environment, I agree. Because these self-play AI’s could learn to cooperate/scheme very powerfully. But if the environment contain (simulations of) human(-like) agents, including intentionally weak ones, and the evaluation includes scoring factors like care and collaboration with them, then it might look different.
Yes, but it makes a difference if your environment is composed entirely of singular agents of the same kind (the self-playing AI) or if it has a variety of simulated agents acting in complex social structures, where the behavior of the self-play AI in the social structure is scored.
I agree.
I think my delta relative to this view is that I think agency is sufficiently complex and non-unique that there’s an endless variety of pseudo-agencies that can just as easily be developed as full agency as long as they receive the appropriate reinforcement, so reasoning of the form “X selection criterion benefits from full agency in pursuit of Y, so therefore full agency in pursuit of Y will develop” is invalid, because instead what will happen is ” full agency in pursuit of Y is a worse solution to X than Z is, so selection for X will select for Z”, mainly due to there being a lot of Zs.
Basically, I postulate the whole “raters make systematic errors—regular, compactly describable, predictable errors” aspect means that you get lots of evidence to support some other notion of agency.
I think it’s most likely if you have some AI trained by some non-imitation-learning self-supervised method (e.g. self-play), and then you fine-tune it with RLHF. Here it would be the self-supervised learning that functions to incentivize the misaligned powerseeking, and RLHF merely failing to avoid it.
It depends on the type of self-play. If the self-play is entirely between AIs, no other human-like parties in the environment, I agree. Because these self-play AI’s could learn to cooperate/scheme very powerfully. But if the environment contain (simulations of) human(-like) agents, including intentionally weak ones, and the evaluation includes scoring factors like care and collaboration with them, then it might look different.
I think you’d actually need some presence of some human-like entities in order for the AI to learn to deceive humans specifically.
Yes, but it makes a difference if your environment is composed entirely of singular agents of the same kind (the self-playing AI) or if it has a variety of simulated agents acting in complex social structures, where the behavior of the self-play AI in the social structure is scored.