How would they be structurally identical? (If someone else understands this, please feel free to jump in and explain.)
AlphaZero is exactly the same as this: you want to explore an exponentially large search tree. You can’t do that. Instead you explore a small part of the search tree. Then you train a model to quickly (lossily) imitate that search. Then you repeat the process, using the learned model in the leaves to effectively search a deeper tree. (Also see Will’s comment.)
Do you consider this problem to be inside your problem scope? I’m guessing yes but I’m not sure and I’m generally still very confused about this.
For now let’s restrict attention to the particular RL algorithms mentioned in the post, to make definitions clearer.
By default these techniques yield an unaligned AI.
I want a version of those techniques that produces aligned AI, which is trying to help us get what we want.
That aligned AI may still need to do dangerous things, e.g. “build a new AI” or “form an organization with a precise and immutable mission statement” or whatever. Alignment doesn’t imply “never has to deal with a difficult situation again,” and I’m not (now) trying to solve alignment for all possible future AI techniques.
We would have encountered those problems even if we replaced the aligned AI with a human. If the AI is aligned, it will at least be trying to solve those problems. But even as such, it may fail. And separately from whether we solve the alignment problem, we may build an incompetent AI (e.g. it may be worse at solving the next round of the alignment problem).
The goal is to get out an AI that is trying to do the right thing. A good litmus test is whether the same problem would occur with a secure human. (Or with a human who happened to be very smart, or with a large group of humans...). If so, then that’s out of scope for me.
To address the example you gave: doing some optimization without introducing misalignment is necessary to perform as well as the RL techniques we are discussing. Avoiding that optimization is in scope.
There may be other optimization or heuristics that an RL agent (or an aligned human) would eventually use in order to perform well, e.g. using a certain kind of external aid. That’s out of scope, because we aren’t trying to compete with all of the things that an RL agent will eventually do (as you say, a powerful RL agent will eventually learn to do everything...) we are trying to compete with the RL algorithm itself.
We need an aligned version of the optimization done by the RL algorithm, not all optimization that the RL agent will eventually decide to do.
AlphaZero is exactly the same as this: you want to explore an exponentially large search tree. You can’t do that. Instead you explore a small part of the search tree. Then you train a model to quickly (lossily) imitate that search. Then you repeat the process, using the learned model in the leaves to effectively search a deeper tree. (Also see Will’s comment.)
For now let’s restrict attention to the particular RL algorithms mentioned in the post, to make definitions clearer.
By default these techniques yield an unaligned AI.
I want a version of those techniques that produces aligned AI, which is trying to help us get what we want.
That aligned AI may still need to do dangerous things, e.g. “build a new AI” or “form an organization with a precise and immutable mission statement” or whatever. Alignment doesn’t imply “never has to deal with a difficult situation again,” and I’m not (now) trying to solve alignment for all possible future AI techniques.
We would have encountered those problems even if we replaced the aligned AI with a human. If the AI is aligned, it will at least be trying to solve those problems. But even as such, it may fail. And separately from whether we solve the alignment problem, we may build an incompetent AI (e.g. it may be worse at solving the next round of the alignment problem).
The goal is to get out an AI that is trying to do the right thing. A good litmus test is whether the same problem would occur with a secure human. (Or with a human who happened to be very smart, or with a large group of humans...). If so, then that’s out of scope for me.
To address the example you gave: doing some optimization without introducing misalignment is necessary to perform as well as the RL techniques we are discussing. Avoiding that optimization is in scope.
There may be other optimization or heuristics that an RL agent (or an aligned human) would eventually use in order to perform well, e.g. using a certain kind of external aid. That’s out of scope, because we aren’t trying to compete with all of the things that an RL agent will eventually do (as you say, a powerful RL agent will eventually learn to do everything...) we are trying to compete with the RL algorithm itself.
We need an aligned version of the optimization done by the RL algorithm, not all optimization that the RL agent will eventually decide to do.