paulfchristiano comments on Clarifying “AI Alignment”

paulfchristiano 24 Nov 2018 19:25 UTC
LW: 2 AF: 1
AF
Assuming you agree that we can’t be certain about which metaethical position is correct yet, I think by implicitly adopting a subjectivist/anti-realist framing, you make the problem seem easier than we should expect it to be.
I don’t see why the anti-realist version is any easier, my preferences about how my values evolve are complex and can depend on the endpoint of that evolution process and on arbitrarily complex logical facts. I think the analogous non-realistic mathematical framing is fine. If anything the realist versions seem easier to me (and this is related to why mathematics seems so much easier than morality), since you can anchor changing preferences to some underlying ground truth and have more potential prospect for error-correction, but I don’t think it’s a big difference.
Additionally, this framing also makes the potential consequences of failing to solve the problem sound less serious than it could potentially be. I.e., if there is such a thing as someone’s true or normative values, then failing to optimize the universe for those values is really bad, but if they just have preferences about how their values evolve, then even if their values fail to evolve in that way, at least whatever values the universe ends up being optimized for are still their values, so not all is lost.
It doesn’t sound that way to me, but I’m happy to avoid framings that might give people the wrong idea.
I think I would prefer to frame the problem as “How can we design/use AI to prevent the corruption of human values, especially corruption caused/exacerbated by the development of AI?”
My main complaint with this framing (and the reason that I don’t use it) is that people respond badly to invoking the concept of “corruption” here—it’s a fuzzy category that we don’t understand, and people seem to interpret it as the speaker wanting values to remain static.
But in terms of the actual meanings rather than their impacts on people, I’d be about as happy with “avoiding corruption of values” as “having our values evolve in a positive way.” I think both of them have small shortcomings as framings. My main problem with corruption is that it suggests an unrealistically bright line / downplays our uncertainty about how to think about changing values and what constitutes corruption.
- Wei Dai 24 Nov 2018 23:07 UTC
  LW: 2 AF: 1
  AF Parent
  
  I don’t see why the anti-realist version is any easier
  
  It seems easier in that the AI / AI designer doesn’t have to worry about the user being wrong about how they want their values to evolve. But you’re right that the realist version might be easier in other ways, so perhaps what I should say instead is that the problem definitely seems harder if we also include the subproblem of figuring out what the right metaethics is in the first place, and (by implicitly assuming a subset of all plausible metaethical positions) the statement of the problem that you proposed also does not convey a proper amount of uncertainty in its difficulty.
  
  My main complaint with this framing (and the reason that I don’t use it) is that people respond badly to invoking the concept of “corruption” here—it’s a fuzzy category that we don’t understand, and people seem to interpret it as the speaker wanting values to remain static.
  
  That’s a good point that I hadn’t thought of. (I guess talking about “drift” has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.) If you or anyone else have a suggestion about how to phrase the problem so as to both avoid this issue and address my concerns about not assuming a particular metaethical position, I’d highly welcome that.
  - paulfchristiano 25 Nov 2018 1:01 UTC
    LW: 2 AF: 1
    AF Parent
    It seems easier in that the AI / AI designer doesn’t have to worry about the user being wrong about how they want their values to evolve.
    That may be a connotation of the “preferences about how their values evolve,” but doesn’t seem like it follows from the anti-realist position.
    I have preferences over what actions my robot takes. Yet if you asked me “what action do you want the robot to take?” I could be mistaken. I need not have access to my own preferences (since they can e.g. depend on empirical facts I don’t know). My preferences over value evolution can be similar.
    Indeed, if moral realists are right, “ultimately converge to the truth” is a perfectly reasonable preference to have about how my preferences evolve. (Though again this may not be captured by the framing “help people’s preferences evolve in the way they want them to evolve.”) Perhaps the distinction is that there is some kind of idealization even of the way that preferences evolve, and maybe at that point it’s easier to just talk about preservation of idealized preferences (though that also has unfortunate implications and at least some minor technical problems).
    I guess talking about “drift” has a similar issue though, in that people might misinterpret it as the speaker wanting values to remain static.
    I agree that drift is also problematic.
    - Wei Dai 26 Nov 2018 19:50 UTC
      LW: 2 AF: 1
      AF Parent
      Would you agree with this way of stating it: There are more ways for someone to be wrong about their values under realism than under anti-realism. Under realism someone could be wrong even if they correctly state their preferences about how they want their values to evolve, because those preferences could themselves be wrong. So assuming an anti-realist position makes the problem sound easier because it implies there are fewer ways for the user to be wrong for the AI / AI designer to worry about.
      - paulfchristiano 27 Nov 2018 18:49 UTC
        LW: 2 AF: 1
        AF Parent
        Could you give an example of a statement you think could be wrong on the realist perspective, for which there couldn’t be a precisely analogous error on the non-realistic perspective?
        There is some uninteresting semantic sense in which there are “more ways to be wrong” (since there is a whole extra category of statements that have truth values...) but not a sense that is relevant to the difficulty of building an AI.
        I might be using the word “values” in a different way than. I think I can say something like “I’d like to deliberate in way X” and be wrong. I guess under non-realism I’m “incorrectly stating my preferences” and under realism I could be “correctly stating my preferences but be wrong,” but I don’t see how to translate that difference into any situation where I build an AI that is adequate on one perspective but inadequate on the other.
        Wei Dai 28 Nov 2018 12:05 UTC
        2 points
        Parent
        Suppose the user says “I want to try to figure out my true/normative values by doing X. Please help me do that.” If moral anti-realism is true, then the AI can only check if the user really wants to do X (e.g., by looking into the user’s brain and checking if X is encoded as a preference somewhere). But if moral realism is true, the AI could also use its own understanding of metaethics and metaphilosophy to predict if doing X would reliably lead to the user’s true/normative values, and warn the user or refuse to help or take some other action if the answer is no. Or if one can’t be certain about metaethics yet, and it looks like X might prematurely lock the user into the wrong values, the AI could warn the user about that.
        
        paulfchristiano 28 Nov 2018 19:55 UTC
        2 points
        Parent
        I definitely don’t mean such a narrow sense of “want my values to evolve.” Seems worth using some language to clarify that.
        In general the three options seem to be:
        You care about what is “good” in the realist sense.
        You care about what the user “actually wants” in some idealized sense.
        You care about what the user “currently wants” in some narrow sense.
        It seems to me that the first two are pretty similar. (And if you are uncertain about whether realism is true, and you’d be in the first case if you accepted realism, it seems like you’d probably be in the second case if you rejected realism. Of course that would depend on the nature of your uncertainty about realism, your views could depend on an arbitrary way on whether realism is true or false depending on what versions of realism/non-realism are competing, but I’m assuming something like the most common realist and non-realist views around here.)
        To defend my original usage both in this thread and in the OP, which I’m not that attached to, I do think it would be typical to say that someone made a mistake if they were trying to help me get what I wanted, but failed to notice or communicate some crucial consideration that would totally change my views about what I wanted—the usual English usage of these terms involves at least mild idealization.