Yeah, I don’t 100% buy the arguments which I gave in bullet-points in my previous comment.
But I guess I would say the following:
I expect to basically not buy any descriptive theory of human preferences. It doesn’t seem likely we could find super-prospect theory which really successfully codified the sort of inconsistencies which we see in human values, and then reap some benefits for AI alignment.
So it seems like what you want to do instead is make very few assumptions at all. Assume that the human can do things like answer questions, but don’t expect responses to be consistent even in the most basic sense of “the same answer to the same question”. Of course, this can’t be the end of the story, since we need to have a criterion—what it means to be aligned with such a human. But hopefully the criterion would also be as agnostic as possible. I don’t want to rely on specific theories of human irrationality.
So, when you say you want to see more discussion of this because it is “absolutely critical”, I am curious about your model of what kind of answers are possible and useful.
My current best-understanding is that if we assume people have arbitrary inconsistencies, it will be impossible to do better than satisfice on different human values by creating near-pareto improvements for intra-human values. But inconsistent values don’t even allow pareto-improvements! Any change makes things incomparable. Given that, I think we do need a super-prospect theory that explains in a systematic way what humans do “wrong” so that we can pick what an AI should respect of human preferences, and what can be ignored.
For instance, I love my children, and I like chocolate. I’m also inconsistent with my preferences in ways that differs; at a given moment of time, I’m much more likely to be upset with my kids and not want them around than I am to not want chocolate. I want the AI to respect my greater but inconsistent preference for my children over the more consistent preference for candy. I don’t know how to formalize this in a way that would generalize, which seems like a problem. Similar problems exist for time preference and similar typical inconsistencies—they are either inconsistent, or at least can be exploited by an AI that has a model which doesn’t think about resolving those inconsistencies.
With a super-prospect theory, I would hope we may be able to define a CEV or similar, which allows large improvements by ignoring the fact that those improvements are bad for some tiny part of my preferences. And perhaps the AI should find the needed super-prospect theory and CEV—but I am deeply unsure about the safety of doing this, or the plausibility of trying to solve it first.
(Beyond this, I think we need to expect that between-human values will differ, and we can keep things safe by insisting on a near-pareto improvement, only things that are a pareto improvement with respect to a very large portion of people, and relatively minor dis-improvements for the remainder. But that’s a different discussion.)
Yeah, I don’t 100% buy the arguments which I gave in bullet-points in my previous comment.
But I guess I would say the following:
I expect to basically not buy any descriptive theory of human preferences. It doesn’t seem likely we could find super-prospect theory which really successfully codified the sort of inconsistencies which we see in human values, and then reap some benefits for AI alignment.
So it seems like what you want to do instead is make very few assumptions at all. Assume that the human can do things like answer questions, but don’t expect responses to be consistent even in the most basic sense of “the same answer to the same question”. Of course, this can’t be the end of the story, since we need to have a criterion—what it means to be aligned with such a human. But hopefully the criterion would also be as agnostic as possible. I don’t want to rely on specific theories of human irrationality.
So, when you say you want to see more discussion of this because it is “absolutely critical”, I am curious about your model of what kind of answers are possible and useful.
My current best-understanding is that if we assume people have arbitrary inconsistencies, it will be impossible to do better than satisfice on different human values by creating near-pareto improvements for intra-human values. But inconsistent values don’t even allow pareto-improvements! Any change makes things incomparable. Given that, I think we do need a super-prospect theory that explains in a systematic way what humans do “wrong” so that we can pick what an AI should respect of human preferences, and what can be ignored.
For instance, I love my children, and I like chocolate. I’m also inconsistent with my preferences in ways that differs; at a given moment of time, I’m much more likely to be upset with my kids and not want them around than I am to not want chocolate. I want the AI to respect my greater but inconsistent preference for my children over the more consistent preference for candy. I don’t know how to formalize this in a way that would generalize, which seems like a problem. Similar problems exist for time preference and similar typical inconsistencies—they are either inconsistent, or at least can be exploited by an AI that has a model which doesn’t think about resolving those inconsistencies.
With a super-prospect theory, I would hope we may be able to define a CEV or similar, which allows large improvements by ignoring the fact that those improvements are bad for some tiny part of my preferences. And perhaps the AI should find the needed super-prospect theory and CEV—but I am deeply unsure about the safety of doing this, or the plausibility of trying to solve it first.
(Beyond this, I think we need to expect that between-human values will differ, and we can keep things safe by insisting on a near-pareto improvement, only things that are a pareto improvement with respect to a very large portion of people, and relatively minor dis-improvements for the remainder. But that’s a different discussion.)