As someone working on value learning, I would like to emphasize that for what I see as realistic approaches, human values are not learned or stored as a big utility function over a single model of the world, instead they’re learned entangled with information about the world, and ways of thinking about the world, and weighing them against each other is a large part of the challenge.
So in large part I agree with you, and I totally agree that an AI will learn about my values faster and better by using different ways of understanding different values.
That said, I think your introduction is pretty bad, and you overrate human programmers.
First, introduction. You worry that an AI using a utility function will be bad because it won’t represent a fundamental difference between preference and religious taboo. To the reader, it sounds like either you’re misunderstanding utility functions, and think that a utility function can’t represent the behavioral consequences of this difference, or you’re saying that it’s important to you that the AI has a little metadata flag saying “religious taboo” inside of it even if there is no behavioral consequence.
Maybe both of these “bad impressions” are somewhat accurate. But you eventually get around to learning and generalization, where I think the actual benefit is, so that’s good :)
Second, your picture of how we get humans’ psychology knowledge into the AI is off. You seem to be picturing the programmers studying all these different ways of interpreting human behavior in terms of values and then designing, by hand, representations that the AI can use to learn those sorts of human values. This radically overestimates human programmers. You correctly point out that value learning is hard, and that if the AI learns the representations itself it’s hard to tell if it’s really capturing what we think is important, but this problem doesn’t go away if it’s humans doing the work!
Just to clarify: I agree that there would be no point in an AI flagging different value types with a little metadata flag saying ‘religious taboo’ vs ‘food preference’ unless that metadata was computationally relevant to the kinds of learning, inference, generalization, and decision-making that the AI did. But my larger point was that humans treat these value types very differently in terms of decision-making (especially in social contexts), so true AI alignment would require that AI systems do too.
I wasn’t picturing human programmers designing value representations by hand for each value type. I don’t know how to take seriously the heterogeneity of value types when developing AI systems. I was just making an argument that we need to solve that problem somehow, if we actually want the AI to act in accordance with the way that humans treat different types of values differently.....
As someone working on value learning, I would like to emphasize that for what I see as realistic approaches, human values are not learned or stored as a big utility function over a single model of the world, instead they’re learned entangled with information about the world, and ways of thinking about the world, and weighing them against each other is a large part of the challenge.
So in large part I agree with you, and I totally agree that an AI will learn about my values faster and better by using different ways of understanding different values.
That said, I think your introduction is pretty bad, and you overrate human programmers.
First, introduction. You worry that an AI using a utility function will be bad because it won’t represent a fundamental difference between preference and religious taboo. To the reader, it sounds like either you’re misunderstanding utility functions, and think that a utility function can’t represent the behavioral consequences of this difference, or you’re saying that it’s important to you that the AI has a little metadata flag saying “religious taboo” inside of it even if there is no behavioral consequence.
Maybe both of these “bad impressions” are somewhat accurate. But you eventually get around to learning and generalization, where I think the actual benefit is, so that’s good :)
Second, your picture of how we get humans’ psychology knowledge into the AI is off. You seem to be picturing the programmers studying all these different ways of interpreting human behavior in terms of values and then designing, by hand, representations that the AI can use to learn those sorts of human values. This radically overestimates human programmers. You correctly point out that value learning is hard, and that if the AI learns the representations itself it’s hard to tell if it’s really capturing what we think is important, but this problem doesn’t go away if it’s humans doing the work!
Hi Charlie, thanks for your comment.
Just to clarify: I agree that there would be no point in an AI flagging different value types with a little metadata flag saying ‘religious taboo’ vs ‘food preference’ unless that metadata was computationally relevant to the kinds of learning, inference, generalization, and decision-making that the AI did. But my larger point was that humans treat these value types very differently in terms of decision-making (especially in social contexts), so true AI alignment would require that AI systems do too.
I wasn’t picturing human programmers designing value representations by hand for each value type. I don’t know how to take seriously the heterogeneity of value types when developing AI systems. I was just making an argument that we need to solve that problem somehow, if we actually want the AI to act in accordance with the way that humans treat different types of values differently.....