I think this is similar to the conclusion I reached in §5.1 of “Thoughts on ‘Process-Based Supervision’”. Admittedly it did take many patient attempts for Holden to get that point through my thick skull ¯\_(ツ)_/¯ So it’s nice that you wrote this.
I do however think there are reasons for concern if you keep the setup (“incentives”) the same allow but arbitrarily complicated Π, including much more complicated than an LLM (e.g., instead of “the AI does one forward pass through GPT-4 and then prints a character”, replace it with “the AI does arbitrary opaque brainstorming for six hours with a giant hidden scratchpad and free access to a cached copy of the internet, and then it prints a message”)—see §5.2 of that same post for that argument.
For example, it’s conceivable that an AI can pull off a treacherous turn on its first try, and if you’re trying to reason about whether that’s gonna happen or not, you can’t straightforwardly do so via your Π* expression—(1) the treacherous-turn output is very unlike any of the AI’s previous outputs (by assumption), and (2) as soon as the AI emits such an output, it’s game over. Again see that post for my (somewhat casual and intuitive) attempt to reason about that possibility.
I think this is similar to the conclusion I reached in §5.1 of “Thoughts on ‘Process-Based Supervision’”.
I agree.
see §5.2 of that same post for that argument
I think my delta relative to this view is that I think agency is sufficiently complex and non-unique that there’s an endless variety of pseudo-agencies that can just as easily be developed as full agency as long as they receive the appropriate reinforcement, so reasoning of the form “X selection criterion benefits from full agency in pursuit of Y, so therefore full agency in pursuit of Y will develop” is invalid, because instead what will happen is ” full agency in pursuit of Y is a worse solution to X than Z is, so selection for X will select for Z”, mainly due to there being a lot of Zs.
For example, it’s conceivable that an AI can pull off a treacherous turn on its first try
I think it’s most likely if you have some AI trained by some non-imitation-learning self-supervised method (e.g. self-play), and then you fine-tune it with RLHF. Here it would be the self-supervised learning that functions to incentivize the misaligned powerseeking, and RLHF merely failing to avoid it.
it’s conceivable that an AI can pull off a treacherous turn on its first try [...if...] an AI trained by some non-imitation-learning self-supervised method (e.g. self-play)
It depends on the type of self-play. If the self-play is entirely between AIs, no other human-like parties in the environment, I agree. Because these self-play AI’s could learn to cooperate/scheme very powerfully. But if the environment contain (simulations of) human(-like) agents, including intentionally weak ones, and the evaluation includes scoring factors like care and collaboration with them, then it might look different.
Yes, but it makes a difference if your environment is composed entirely of singular agents of the same kind (the self-playing AI) or if it has a variety of simulated agents acting in complex social structures, where the behavior of the self-play AI in the social structure is scored.
I think this is similar to the conclusion I reached in §5.1 of “Thoughts on ‘Process-Based Supervision’”. Admittedly it did take many patient attempts for Holden to get that point through my thick skull ¯\_(ツ)_/¯ So it’s nice that you wrote this.
I do however think there are reasons for concern if you keep the setup (“incentives”) the same allow but arbitrarily complicated Π, including much more complicated than an LLM (e.g., instead of “the AI does one forward pass through GPT-4 and then prints a character”, replace it with “the AI does arbitrary opaque brainstorming for six hours with a giant hidden scratchpad and free access to a cached copy of the internet, and then it prints a message”)—see §5.2 of that same post for that argument.
For example, it’s conceivable that an AI can pull off a treacherous turn on its first try, and if you’re trying to reason about whether that’s gonna happen or not, you can’t straightforwardly do so via your Π* expression—(1) the treacherous-turn output is very unlike any of the AI’s previous outputs (by assumption), and (2) as soon as the AI emits such an output, it’s game over. Again see that post for my (somewhat casual and intuitive) attempt to reason about that possibility.
I agree.
I think my delta relative to this view is that I think agency is sufficiently complex and non-unique that there’s an endless variety of pseudo-agencies that can just as easily be developed as full agency as long as they receive the appropriate reinforcement, so reasoning of the form “X selection criterion benefits from full agency in pursuit of Y, so therefore full agency in pursuit of Y will develop” is invalid, because instead what will happen is ” full agency in pursuit of Y is a worse solution to X than Z is, so selection for X will select for Z”, mainly due to there being a lot of Zs.
Basically, I postulate the whole “raters make systematic errors—regular, compactly describable, predictable errors” aspect means that you get lots of evidence to support some other notion of agency.
I think it’s most likely if you have some AI trained by some non-imitation-learning self-supervised method (e.g. self-play), and then you fine-tune it with RLHF. Here it would be the self-supervised learning that functions to incentivize the misaligned powerseeking, and RLHF merely failing to avoid it.
It depends on the type of self-play. If the self-play is entirely between AIs, no other human-like parties in the environment, I agree. Because these self-play AI’s could learn to cooperate/scheme very powerfully. But if the environment contain (simulations of) human(-like) agents, including intentionally weak ones, and the evaluation includes scoring factors like care and collaboration with them, then it might look different.
I think you’d actually need some presence of some human-like entities in order for the AI to learn to deceive humans specifically.
Yes, but it makes a difference if your environment is composed entirely of singular agents of the same kind (the self-playing AI) or if it has a variety of simulated agents acting in complex social structures, where the behavior of the self-play AI in the social structure is scored.