However, why should you expect π′ to be a “better” policy than π according to human values?
I feel like this is sneaking in the assumption that we’re going to partition the policy into an optimization step and a value learning step. Say we train using data D sampled from D, then my point is that π′ generalizes to D optimally. Value learning doesn’t do this. In the context of algorithmic complexity, value learning inserts a prior about how a policy ought to be structured.
On a philosophical front, I’m of the opinion that any error in defining “human values” will blow up arbitrarily if given to an optimizer with arbitrary capability. Thus, the only way to safely work with this inductive bias is to constrain the capability of the optimizer. If this is done correctly, I’d assume the agent will only be barely superhuman according to “human values”. These are extra steps and regularizations that effectively remove the inductive bias with the promise that we can then control how “superhuman” the agent will be. My conclusion (tentatively) is that there is no way to arbitrarily extrapolate human values and doing so, even a little, introduces risk.
If your conclusion is “value learning can never work and is risky”, that seems fine (if maybe a bit strong). I agree it’s not obvious that (ambitious) value learning can work.
Let’s suppose you want to e.g. play Go, and so you use AIXIL on Lee Sedol’s games. This will give you an agent that plays however Lee Sedol would play. In particular, AlphaZero would beat this agent handily (at the game of Go). This is what I mean when I say you’re limited to human performance.
In contrast, the hope with value learning was that you can apply it to Lee Sedol’s games, and get out the reward “1 if you win, 0 if you lose”, which when optimized gets you AlphaZero-levels of capability (i.e. superhuman performance).
I think it’s reasonable to say “but there’s no reason to expect that value learning will infer the right reward, so we probably won’t do better than imitation” (and I collated Chapter 1 of the Value Learning sequence to make this point). In that case, you should expect that imitation = human performance and value learning = subhuman / catastrophic performance.
According to me, the main challenge of AI x-risk is how to deal with superhuman AI systems, and so if you have this latter position, I think you should be pessimistic about both imitation learning and value learning (unless you combine it with something that lets you scale to superhuman, e.g. iterated amplification, debate or recursive reward modeling).
I agree with what you’re saying. Perhaps, I’m being a bit strong. I’m mostly talking about ambitious value learning in an open-ended environment. The game of Go doesn’t have inherent computing capability so anything the agent does is rather constrained to begin with. I’d hope (guess) that alignment in similarly closed environments is achievable. I’d also like to point out that in such scenarios I’d expect it to be normally possible to give exact goal descriptions rendering value learning superfluous.
In theory, I’m actually onboard with a weakly superhuman AI. I’m mostly skeptical of the general case. I suppose that makes me sympathetic to approaches that iterate/collectivize things already known to work.
I feel like this is sneaking in the assumption that we’re going to partition the policy into an optimization step and a value learning step. Say we train using data D sampled from D, then my point is that π′ generalizes to D optimally. Value learning doesn’t do this. In the context of algorithmic complexity, value learning inserts a prior about how a policy ought to be structured.
On a philosophical front, I’m of the opinion that any error in defining “human values” will blow up arbitrarily if given to an optimizer with arbitrary capability. Thus, the only way to safely work with this inductive bias is to constrain the capability of the optimizer. If this is done correctly, I’d assume the agent will only be barely superhuman according to “human values”. These are extra steps and regularizations that effectively remove the inductive bias with the promise that we can then control how “superhuman” the agent will be. My conclusion (tentatively) is that there is no way to arbitrarily extrapolate human values and doing so, even a little, introduces risk.
If your conclusion is “value learning can never work and is risky”, that seems fine (if maybe a bit strong). I agree it’s not obvious that (ambitious) value learning can work.
Let’s suppose you want to e.g. play Go, and so you use AIXIL on Lee Sedol’s games. This will give you an agent that plays however Lee Sedol would play. In particular, AlphaZero would beat this agent handily (at the game of Go). This is what I mean when I say you’re limited to human performance.
In contrast, the hope with value learning was that you can apply it to Lee Sedol’s games, and get out the reward “1 if you win, 0 if you lose”, which when optimized gets you AlphaZero-levels of capability (i.e. superhuman performance).
I think it’s reasonable to say “but there’s no reason to expect that value learning will infer the right reward, so we probably won’t do better than imitation” (and I collated Chapter 1 of the Value Learning sequence to make this point). In that case, you should expect that imitation = human performance and value learning = subhuman / catastrophic performance.
According to me, the main challenge of AI x-risk is how to deal with superhuman AI systems, and so if you have this latter position, I think you should be pessimistic about both imitation learning and value learning (unless you combine it with something that lets you scale to superhuman, e.g. iterated amplification, debate or recursive reward modeling).
I agree with what you’re saying. Perhaps, I’m being a bit strong. I’m mostly talking about ambitious value learning in an open-ended environment. The game of Go doesn’t have inherent computing capability so anything the agent does is rather constrained to begin with. I’d hope (guess) that alignment in similarly closed environments is achievable. I’d also like to point out that in such scenarios I’d expect it to be normally possible to give exact goal descriptions rendering value learning superfluous.
In theory, I’m actually onboard with a weakly superhuman AI. I’m mostly skeptical of the general case. I suppose that makes me sympathetic to approaches that iterate/collectivize things already known to work.