I’m glad that you’re thinking about these things, but this misses what I think is the hard part of the problem: truly out-of-sample cases. The thing that I’m worried about isn’t that a superhuman AI will map (human beings suffering in a currently understood way) to the concept “good”, but that it will have a lot of degrees of freedom of where to map (thing that is only possible with nanotech, which human brains aren’t capable of fully understanding) or (general strategy for meme-hacking human brains, which human brains aren’t able to conceptualize), etc, and that a process of picking the best action may be likely to pick up one of these edge cases that would differ from our extrapolated volitions.
Basically, I don’t see how we can be confident yet that this continues to work once the AI is able to come up with creative edge cases that our brains aren’t explicitly able to encompass or classify the way our extrapolated volitions would want. For an example of progress that might help with this, I might hope there’s a clever way to regularize model selection so that they don’t include edge cases of this sort, but I’ve not seen anything of that type.
Hmm. In a future post, I’m hoping to get to the question of “suppose that an AI could expand the way it has defined its existing concepts by including additional dimensions which humans are incapable of conceptualizing, and this led its values to diverge from human ones”, and I agree that this post is not yet sufficient to solve that one. I think that’s the same problem as you’re talking about (if previously your concepts had N dimensions and now they have N+1, you could find something that fulfilled all the previous criteria while still being different from what we’d prefer if we knew about the N+1th dimension), but I’m not entirely sure?
Yes, except I’m much more pessimistic about reinforcement learning sufficing, since I expect that a superhuman-engineering-capability AI would have, not just a few additional degrees of freedom, but incredibly many. And then it would not suffice for the AI to make its best guess about how to extrapolate human values to a world with nanotech and memehacking and (whatever else)- that would almost surely lead to disaster.
So how do you feel about the proposal I made in my latest post, to evaluate the new situation in light of the old values? (Might want to continue this thread in the comments of that post.)
My (low-confidence) intuition is that while it’s certainly possible to easily screw up the implementation, if the system is engineered correctly, then the process by which the AI applies the old values to the new situation/new concept space should be essentially same as the one by which humans would do it. Of course, in practice “the system being engineered correctly” might require e.g. a very human-like design including a humanoid body etc. in order to get the initial concept space to become sufficiently similar to the human one, so that’s a problem.
I think I’m also somewhat more optimistic about the range of solutions that might qualify as “good”, because a large part of human values seem to be determined by reinforcement learning. (Compare Hanson on plasticity.) I suspect that if e.g. nanotech and memehacking became available, then the “best” approach to deal with it is underdetermined by our current values, and just because an AI would extrapolate our current values differently than humans would, doesn’t necessarily mean that that extrapolation would be any worse. I mean, if the best extrapolation is genuinely underdetermined by our current values, then that means that a wide range of possibilities is equally good pretty much by definition.
I’m glad that you’re thinking about these things, but this misses what I think is the hard part of the problem: truly out-of-sample cases. The thing that I’m worried about isn’t that a superhuman AI will map (human beings suffering in a currently understood way) to the concept “good”, but that it will have a lot of degrees of freedom of where to map (thing that is only possible with nanotech, which human brains aren’t capable of fully understanding) or (general strategy for meme-hacking human brains, which human brains aren’t able to conceptualize), etc, and that a process of picking the best action may be likely to pick up one of these edge cases that would differ from our extrapolated volitions.
Basically, I don’t see how we can be confident yet that this continues to work once the AI is able to come up with creative edge cases that our brains aren’t explicitly able to encompass or classify the way our extrapolated volitions would want. For an example of progress that might help with this, I might hope there’s a clever way to regularize model selection so that they don’t include edge cases of this sort, but I’ve not seen anything of that type.
Hmm. In a future post, I’m hoping to get to the question of “suppose that an AI could expand the way it has defined its existing concepts by including additional dimensions which humans are incapable of conceptualizing, and this led its values to diverge from human ones”, and I agree that this post is not yet sufficient to solve that one. I think that’s the same problem as you’re talking about (if previously your concepts had N dimensions and now they have N+1, you could find something that fulfilled all the previous criteria while still being different from what we’d prefer if we knew about the N+1th dimension), but I’m not entirely sure?
Yes, except I’m much more pessimistic about reinforcement learning sufficing, since I expect that a superhuman-engineering-capability AI would have, not just a few additional degrees of freedom, but incredibly many. And then it would not suffice for the AI to make its best guess about how to extrapolate human values to a world with nanotech and memehacking and (whatever else)- that would almost surely lead to disaster.
So how do you feel about the proposal I made in my latest post, to evaluate the new situation in light of the old values? (Might want to continue this thread in the comments of that post.)
My (low-confidence) intuition is that while it’s certainly possible to easily screw up the implementation, if the system is engineered correctly, then the process by which the AI applies the old values to the new situation/new concept space should be essentially same as the one by which humans would do it. Of course, in practice “the system being engineered correctly” might require e.g. a very human-like design including a humanoid body etc. in order to get the initial concept space to become sufficiently similar to the human one, so that’s a problem.
I think I’m also somewhat more optimistic about the range of solutions that might qualify as “good”, because a large part of human values seem to be determined by reinforcement learning. (Compare Hanson on plasticity.) I suspect that if e.g. nanotech and memehacking became available, then the “best” approach to deal with it is underdetermined by our current values, and just because an AI would extrapolate our current values differently than humans would, doesn’t necessarily mean that that extrapolation would be any worse. I mean, if the best extrapolation is genuinely underdetermined by our current values, then that means that a wide range of possibilities is equally good pretty much by definition.