Nice! This is definitely one of those clever ideas that seems obvious only after you’ve heard it.
The issue with the straightforward version of this is that value learning is not merely about learning human preferences, it’s also about learning human meta-preferences. Or to put it another way, we wouldn’t be satisfied with the utility function we appear to be rationally optimizing, because we think our actual actions contain mistakes. Or to put it another way, you don’t just need to learn a utility function, you also need to learn an “irrationality model” of how the agent makes mistakes.
This isn’t a fatal blow to the idea, but it seems to make generating the training data much more challenging, because the training data needs to train in a tendency for interpreting humans how they want to be interpreted.
Certainly we make mistakes. Can you elaborate on the difference between what we appear to be optimizing (plus or minus mistakes, akrasia, etc.) and what we actually value? Is this CEV, or something else? CEV would potentially be an important part of extending such a model to the point of being useful for real world AI alignment, but it could be very difficult to implement, at least at first.
So, if I’m a smoker who wants to quit but finds it hard, I want the AI to learn that I want to quit. But if you didn’t bias the training data towards cases where agents have addictions they don’t want (as opposed to straightforwardly doing what they want, or even complaining about things that they do in fact want), the AI will learn that I want to keep smoking while complaining about it.
Similar things show up for lot of thongs we’d call our biases (loss aversion, my-side bias, etc.). A nonhuman observer of our society probably needs to be able to read our books and articles and apply them to interpreting us. This whole “intepret us how we want to be interpreted” thing is one of the requirements for CEV, yeah.
Sounds like there could be at least two approaches here. One would be CEV. The other would be to consider the smoker as wanting to smoke, or at least to avoid withdrawal cravings, and also to avoid the downsides of smoking. A sufficiently powerful agent operating on this model would try to suppress withdrawals, cure lung cancer or otherwise act in the smoker’s interests. On the other hand, a less powerful agent with this model might try to simply keep the smoker smoking. There’s an interesting question here about to what extent revealed preferences are a person’s true preferences, or whether addictions and the like should be considered an unwanted addition to one’s personality.
Nice! This is definitely one of those clever ideas that seems obvious only after you’ve heard it.
The issue with the straightforward version of this is that value learning is not merely about learning human preferences, it’s also about learning human meta-preferences. Or to put it another way, we wouldn’t be satisfied with the utility function we appear to be rationally optimizing, because we think our actual actions contain mistakes. Or to put it another way, you don’t just need to learn a utility function, you also need to learn an “irrationality model” of how the agent makes mistakes.
This isn’t a fatal blow to the idea, but it seems to make generating the training data much more challenging, because the training data needs to train in a tendency for interpreting humans how they want to be interpreted.
Certainly we make mistakes. Can you elaborate on the difference between what we appear to be optimizing (plus or minus mistakes, akrasia, etc.) and what we actually value? Is this CEV, or something else? CEV would potentially be an important part of extending such a model to the point of being useful for real world AI alignment, but it could be very difficult to implement, at least at first.
So, if I’m a smoker who wants to quit but finds it hard, I want the AI to learn that I want to quit. But if you didn’t bias the training data towards cases where agents have addictions they don’t want (as opposed to straightforwardly doing what they want, or even complaining about things that they do in fact want), the AI will learn that I want to keep smoking while complaining about it.
Similar things show up for lot of thongs we’d call our biases (loss aversion, my-side bias, etc.). A nonhuman observer of our society probably needs to be able to read our books and articles and apply them to interpreting us. This whole “intepret us how we want to be interpreted” thing is one of the requirements for CEV, yeah.
A human psychologist might conclude the same thing. :)
An economist, definitely.
Sounds like there could be at least two approaches here. One would be CEV. The other would be to consider the smoker as wanting to smoke, or at least to avoid withdrawal cravings, and also to avoid the downsides of smoking. A sufficiently powerful agent operating on this model would try to suppress withdrawals, cure lung cancer or otherwise act in the smoker’s interests. On the other hand, a less powerful agent with this model might try to simply keep the smoker smoking. There’s an interesting question here about to what extent revealed preferences are a person’s true preferences, or whether addictions and the like should be considered an unwanted addition to one’s personality.