I’m not totally sure what you’re asking, but some thoughts:
Yes, if your goal is to recover a policy (i.e. imitation learning), then value learning is only one approach.
Yes, you can recover a policy by supervised learning on a dataset of the policy’s behavior. This could be done with neural nets, or it could be done with Bayesian inference with the Solomonoff prior. Either approach would work with enough data (we don’t know how much data though), and neither of them inherently learn values (though they may do so as an instrumental strategy).
If you imitate a human policy, you are limiting yourself to human performance. The original hope of value learning was that if a more capable agent optimized the learned reward, you could get to superhuman performance, something that AIXIL would not do.
I guess I’m confused by your third point. It seems clear that AIXI optimized on any learned reward function will have superhuman performance. However, AIXI is completely unaligned via wireheading. My point with the Kolgomorov argument is that AIXIL is much more likely to behave reasonably than AIXVL. Almost by definition, AIXIL will generalize most similarly to a human. Moreover, any value learning attempt will have worse generalization capability. I’m hesitant, but it seems I should conclude value alignment is a non-starter.
If you generate a dataset D from a policy π (e.g. human behavior) and then run AIXIL(D) and get policy π′, you can expect that KL(π′)≤KL(π). I think you could claim “as long as D is big enough, the best compression is just to replicate the human decision process, and so we’ll have π′=π”.
Alternatively, you could claim that you’ll find an even better compression of D than the human policy π. In that case, you expect π′≠π and π′ is lower KL-complexity than π. However, why should you expect π′ to be a “better” policy than π according to human values?
Literally cannot delete this pi, please ignore it: π
However, why should you expect π′ to be a “better” policy than π according to human values?
I feel like this is sneaking in the assumption that we’re going to partition the policy into an optimization step and a value learning step. Say we train using data D sampled from D, then my point is that π′ generalizes to D optimally. Value learning doesn’t do this. In the context of algorithmic complexity, value learning inserts a prior about how a policy ought to be structured.
On a philosophical front, I’m of the opinion that any error in defining “human values” will blow up arbitrarily if given to an optimizer with arbitrary capability. Thus, the only way to safely work with this inductive bias is to constrain the capability of the optimizer. If this is done correctly, I’d assume the agent will only be barely superhuman according to “human values”. These are extra steps and regularizations that effectively remove the inductive bias with the promise that we can then control how “superhuman” the agent will be. My conclusion (tentatively) is that there is no way to arbitrarily extrapolate human values and doing so, even a little, introduces risk.
If your conclusion is “value learning can never work and is risky”, that seems fine (if maybe a bit strong). I agree it’s not obvious that (ambitious) value learning can work.
Let’s suppose you want to e.g. play Go, and so you use AIXIL on Lee Sedol’s games. This will give you an agent that plays however Lee Sedol would play. In particular, AlphaZero would beat this agent handily (at the game of Go). This is what I mean when I say you’re limited to human performance.
In contrast, the hope with value learning was that you can apply it to Lee Sedol’s games, and get out the reward “1 if you win, 0 if you lose”, which when optimized gets you AlphaZero-levels of capability (i.e. superhuman performance).
I think it’s reasonable to say “but there’s no reason to expect that value learning will infer the right reward, so we probably won’t do better than imitation” (and I collated Chapter 1 of the Value Learning sequence to make this point). In that case, you should expect that imitation = human performance and value learning = subhuman / catastrophic performance.
According to me, the main challenge of AI x-risk is how to deal with superhuman AI systems, and so if you have this latter position, I think you should be pessimistic about both imitation learning and value learning (unless you combine it with something that lets you scale to superhuman, e.g. iterated amplification, debate or recursive reward modeling).
I agree with what you’re saying. Perhaps, I’m being a bit strong. I’m mostly talking about ambitious value learning in an open-ended environment. The game of Go doesn’t have inherent computing capability so anything the agent does is rather constrained to begin with. I’d hope (guess) that alignment in similarly closed environments is achievable. I’d also like to point out that in such scenarios I’d expect it to be normally possible to give exact goal descriptions rendering value learning superfluous.
In theory, I’m actually onboard with a weakly superhuman AI. I’m mostly skeptical of the general case. I suppose that makes me sympathetic to approaches that iterate/collectivize things already known to work.
I’m not totally sure what you’re asking, but some thoughts:
Yes, if your goal is to recover a policy (i.e. imitation learning), then value learning is only one approach.
Yes, you can recover a policy by supervised learning on a dataset of the policy’s behavior. This could be done with neural nets, or it could be done with Bayesian inference with the Solomonoff prior. Either approach would work with enough data (we don’t know how much data though), and neither of them inherently learn values (though they may do so as an instrumental strategy).
If you imitate a human policy, you are limiting yourself to human performance. The original hope of value learning was that if a more capable agent optimized the learned reward, you could get to superhuman performance, something that AIXIL would not do.
I guess I’m confused by your third point. It seems clear that AIXI optimized on any learned reward function will have superhuman performance. However, AIXI is completely unaligned via wireheading. My point with the Kolgomorov argument is that AIXIL is much more likely to behave reasonably than AIXVL. Almost by definition, AIXIL will generalize most similarly to a human. Moreover, any value learning attempt will have worse generalization capability. I’m hesitant, but it seems I should conclude value alignment is a non-starter.
If you generate a dataset D from a policy π (e.g. human behavior) and then run AIXIL(D) and get policy π′, you can expect that KL(π′)≤KL(π). I think you could claim “as long as D is big enough, the best compression is just to replicate the human decision process, and so we’ll have π′=π”.
Alternatively, you could claim that you’ll find an even better compression of D than the human policy π. In that case, you expect π′≠π and π′ is lower KL-complexity than π. However, why should you expect π′ to be a “better” policy than π according to human values?
Literally cannot delete this pi, please ignore it: π
I feel like this is sneaking in the assumption that we’re going to partition the policy into an optimization step and a value learning step. Say we train using data D sampled from D, then my point is that π′ generalizes to D optimally. Value learning doesn’t do this. In the context of algorithmic complexity, value learning inserts a prior about how a policy ought to be structured.
On a philosophical front, I’m of the opinion that any error in defining “human values” will blow up arbitrarily if given to an optimizer with arbitrary capability. Thus, the only way to safely work with this inductive bias is to constrain the capability of the optimizer. If this is done correctly, I’d assume the agent will only be barely superhuman according to “human values”. These are extra steps and regularizations that effectively remove the inductive bias with the promise that we can then control how “superhuman” the agent will be. My conclusion (tentatively) is that there is no way to arbitrarily extrapolate human values and doing so, even a little, introduces risk.
If your conclusion is “value learning can never work and is risky”, that seems fine (if maybe a bit strong). I agree it’s not obvious that (ambitious) value learning can work.
Let’s suppose you want to e.g. play Go, and so you use AIXIL on Lee Sedol’s games. This will give you an agent that plays however Lee Sedol would play. In particular, AlphaZero would beat this agent handily (at the game of Go). This is what I mean when I say you’re limited to human performance.
In contrast, the hope with value learning was that you can apply it to Lee Sedol’s games, and get out the reward “1 if you win, 0 if you lose”, which when optimized gets you AlphaZero-levels of capability (i.e. superhuman performance).
I think it’s reasonable to say “but there’s no reason to expect that value learning will infer the right reward, so we probably won’t do better than imitation” (and I collated Chapter 1 of the Value Learning sequence to make this point). In that case, you should expect that imitation = human performance and value learning = subhuman / catastrophic performance.
According to me, the main challenge of AI x-risk is how to deal with superhuman AI systems, and so if you have this latter position, I think you should be pessimistic about both imitation learning and value learning (unless you combine it with something that lets you scale to superhuman, e.g. iterated amplification, debate or recursive reward modeling).
I agree with what you’re saying. Perhaps, I’m being a bit strong. I’m mostly talking about ambitious value learning in an open-ended environment. The game of Go doesn’t have inherent computing capability so anything the agent does is rather constrained to begin with. I’d hope (guess) that alignment in similarly closed environments is achievable. I’d also like to point out that in such scenarios I’d expect it to be normally possible to give exact goal descriptions rendering value learning superfluous.
In theory, I’m actually onboard with a weakly superhuman AI. I’m mostly skeptical of the general case. I suppose that makes me sympathetic to approaches that iterate/collectivize things already known to work.