I see how my above question seems naive. Maybe it is. But if one potential answer to the alignment problem lies in the way our brains work, maybe we should try to understand that better, instead of (or in addition to) letting a machine figure it out for us through some kind of “value learning”. (Copied from my answer to AprilSR:) I stumbled across two papers from a few years ago by a psychologist, Mark Muraven, who thinks that the way humans deal with conflicting goals could be important for AI alignment (https://arxiv.org/abs/1701.01487 and https://arxiv.org/abs/1703.06354).They appear a bit shallow to me and don’t contain any specific ideas on how to implement this. But maybe Muraven has a point here.
But if one potential answer to the alignment problem lies in the way our brains work, maybe we should try to understand that better, instead of (or in addition to) letting a machine figure it out for us through some kind of “value learning”.
Ah, I see. You might be interested in this sequence then!
Yes. But my impression so far is that anything we can even imagine in terms of a goal function will go badly wrong somehow. So I find it a bit reassuring that at least one such function that will not necessarily lead to doom seems to exist, even if we don’t know how to encode it yet.
I think we (mostly) all agree that we want to somehow encode human values into AGIs. That’s not a new idea. The devil is in the details.
I see how my above question seems naive. Maybe it is. But if one potential answer to the alignment problem lies in the way our brains work, maybe we should try to understand that better, instead of (or in addition to) letting a machine figure it out for us through some kind of “value learning”. (Copied from my answer to AprilSR:) I stumbled across two papers from a few years ago by a psychologist, Mark Muraven, who thinks that the way humans deal with conflicting goals could be important for AI alignment (https://arxiv.org/abs/1701.01487 and https://arxiv.org/abs/1703.06354).They appear a bit shallow to me and don’t contain any specific ideas on how to implement this. But maybe Muraven has a point here.
I think your question is excellent. “How does the single existing kind of generally intelligent agent form its values?” is one of the most important and neglected questions in all of alignment, I think.
Ah, I see. You might be interested in this sequence then!
Yes, thank you!
Yes. But my impression so far is that anything we can even imagine in terms of a goal function will go badly wrong somehow. So I find it a bit reassuring that at least one such function that will not necessarily lead to doom seems to exist, even if we don’t know how to encode it yet.