Hmm. In a future post, I’m hoping to get to the question of “suppose that an AI could expand the way it has defined its existing concepts by including additional dimensions which humans are incapable of conceptualizing, and this led its values to diverge from human ones”, and I agree that this post is not yet sufficient to solve that one. I think that’s the same problem as you’re talking about (if previously your concepts had N dimensions and now they have N+1, you could find something that fulfilled all the previous criteria while still being different from what we’d prefer if we knew about the N+1th dimension), but I’m not entirely sure?
Yes, except I’m much more pessimistic about reinforcement learning sufficing, since I expect that a superhuman-engineering-capability AI would have, not just a few additional degrees of freedom, but incredibly many. And then it would not suffice for the AI to make its best guess about how to extrapolate human values to a world with nanotech and memehacking and (whatever else)- that would almost surely lead to disaster.
So how do you feel about the proposal I made in my latest post, to evaluate the new situation in light of the old values? (Might want to continue this thread in the comments of that post.)
My (low-confidence) intuition is that while it’s certainly possible to easily screw up the implementation, if the system is engineered correctly, then the process by which the AI applies the old values to the new situation/new concept space should be essentially same as the one by which humans would do it. Of course, in practice “the system being engineered correctly” might require e.g. a very human-like design including a humanoid body etc. in order to get the initial concept space to become sufficiently similar to the human one, so that’s a problem.
I think I’m also somewhat more optimistic about the range of solutions that might qualify as “good”, because a large part of human values seem to be determined by reinforcement learning. (Compare Hanson on plasticity.) I suspect that if e.g. nanotech and memehacking became available, then the “best” approach to deal with it is underdetermined by our current values, and just because an AI would extrapolate our current values differently than humans would, doesn’t necessarily mean that that extrapolation would be any worse. I mean, if the best extrapolation is genuinely underdetermined by our current values, then that means that a wide range of possibilities is equally good pretty much by definition.
Hmm. In a future post, I’m hoping to get to the question of “suppose that an AI could expand the way it has defined its existing concepts by including additional dimensions which humans are incapable of conceptualizing, and this led its values to diverge from human ones”, and I agree that this post is not yet sufficient to solve that one. I think that’s the same problem as you’re talking about (if previously your concepts had N dimensions and now they have N+1, you could find something that fulfilled all the previous criteria while still being different from what we’d prefer if we knew about the N+1th dimension), but I’m not entirely sure?
Yes, except I’m much more pessimistic about reinforcement learning sufficing, since I expect that a superhuman-engineering-capability AI would have, not just a few additional degrees of freedom, but incredibly many. And then it would not suffice for the AI to make its best guess about how to extrapolate human values to a world with nanotech and memehacking and (whatever else)- that would almost surely lead to disaster.
So how do you feel about the proposal I made in my latest post, to evaluate the new situation in light of the old values? (Might want to continue this thread in the comments of that post.)
My (low-confidence) intuition is that while it’s certainly possible to easily screw up the implementation, if the system is engineered correctly, then the process by which the AI applies the old values to the new situation/new concept space should be essentially same as the one by which humans would do it. Of course, in practice “the system being engineered correctly” might require e.g. a very human-like design including a humanoid body etc. in order to get the initial concept space to become sufficiently similar to the human one, so that’s a problem.
I think I’m also somewhat more optimistic about the range of solutions that might qualify as “good”, because a large part of human values seem to be determined by reinforcement learning. (Compare Hanson on plasticity.) I suspect that if e.g. nanotech and memehacking became available, then the “best” approach to deal with it is underdetermined by our current values, and just because an AI would extrapolate our current values differently than humans would, doesn’t necessarily mean that that extrapolation would be any worse. I mean, if the best extrapolation is genuinely underdetermined by our current values, then that means that a wide range of possibilities is equally good pretty much by definition.