Certainly we make mistakes. Can you elaborate on the difference between what we appear to be optimizing (plus or minus mistakes, akrasia, etc.) and what we actually value? Is this CEV, or something else? CEV would potentially be an important part of extending such a model to the point of being useful for real world AI alignment, but it could be very difficult to implement, at least at first.
So, if I’m a smoker who wants to quit but finds it hard, I want the AI to learn that I want to quit. But if you didn’t bias the training data towards cases where agents have addictions they don’t want (as opposed to straightforwardly doing what they want, or even complaining about things that they do in fact want), the AI will learn that I want to keep smoking while complaining about it.
Similar things show up for lot of thongs we’d call our biases (loss aversion, my-side bias, etc.). A nonhuman observer of our society probably needs to be able to read our books and articles and apply them to interpreting us. This whole “intepret us how we want to be interpreted” thing is one of the requirements for CEV, yeah.
Sounds like there could be at least two approaches here. One would be CEV. The other would be to consider the smoker as wanting to smoke, or at least to avoid withdrawal cravings, and also to avoid the downsides of smoking. A sufficiently powerful agent operating on this model would try to suppress withdrawals, cure lung cancer or otherwise act in the smoker’s interests. On the other hand, a less powerful agent with this model might try to simply keep the smoker smoking. There’s an interesting question here about to what extent revealed preferences are a person’s true preferences, or whether addictions and the like should be considered an unwanted addition to one’s personality.
Certainly we make mistakes. Can you elaborate on the difference between what we appear to be optimizing (plus or minus mistakes, akrasia, etc.) and what we actually value? Is this CEV, or something else? CEV would potentially be an important part of extending such a model to the point of being useful for real world AI alignment, but it could be very difficult to implement, at least at first.
So, if I’m a smoker who wants to quit but finds it hard, I want the AI to learn that I want to quit. But if you didn’t bias the training data towards cases where agents have addictions they don’t want (as opposed to straightforwardly doing what they want, or even complaining about things that they do in fact want), the AI will learn that I want to keep smoking while complaining about it.
Similar things show up for lot of thongs we’d call our biases (loss aversion, my-side bias, etc.). A nonhuman observer of our society probably needs to be able to read our books and articles and apply them to interpreting us. This whole “intepret us how we want to be interpreted” thing is one of the requirements for CEV, yeah.
A human psychologist might conclude the same thing. :)
An economist, definitely.
Sounds like there could be at least two approaches here. One would be CEV. The other would be to consider the smoker as wanting to smoke, or at least to avoid withdrawal cravings, and also to avoid the downsides of smoking. A sufficiently powerful agent operating on this model would try to suppress withdrawals, cure lung cancer or otherwise act in the smoker’s interests. On the other hand, a less powerful agent with this model might try to simply keep the smoker smoking. There’s an interesting question here about to what extent revealed preferences are a person’s true preferences, or whether addictions and the like should be considered an unwanted addition to one’s personality.