Rohin Shah comments on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

Rohin Shah 21 Jan 2021 23:43 UTC
LW: 2 AF: 2
AF
Children also learn right from wrong—I’d be interested in where you draw the line between “An AI that learns common sense” and “An AI that learns right from wrong.”
I’m happy to assume that AI will learn right from wrong to about the level that children do. This is not a sufficiently good definition of “the good” that we can then optimize it.
My suspicion, which is interesting to me so I’ll explain it even if you’re going to tell me that I’m off base, is that you’re thinking that part of common sense is to avoid uncertain or extreme situations (e.g. reshaping the galaxy with nanotechnology), and that common sense is generally safe and trustworthy for an AI to follow, in a way that doesn’t carry over to “knowing right from wrong.” An AI that has learned right from wrong to the same extent that humans learn it might make dangerous moral mistakes.
That sounds basically right, with the caveat that you want to be a bit more specific and precise with what the AI system should do than just saying “common sense”; I’m using the phrase as a placeholder for something more precise that we need to figure out.
Also, I’d change the last sentence to “an AI that has learned right from wrong to the same extent that humans learn it, and then optimizes for right things as hard as possible, will probably make dangerous moral mistakes”. The point is that when you’re trying to define “the good” and then optimize it, you need to be very very correct in your definition, whereas when you’re trying not to optimize too hard in the first place (which is part of what I mean by “common sense”) then that’s no longer the case.
After all, one of the most universally acknowledged things about common sense is that it’s uncommon among humans!
I think at this point I don’t think we’re talking about the same “common sense”.
Merely doing common sense as well as humans seems like a recipe for making a horrible mistake because it seemed like the right thing at the time—this opens the door to the same old alignment problems (like self-reflection and meta-preferences [or should that be meta-common-sense]).
But why?
they’re saying that building an AI with common sense is in the same epistemological category as building an AI that knows right from wrong.
Again it depends on how accurate the “right/wrong classifier” needs to be, and how accurate the “common sense” needs to be. My main claim is that the path to safety that goes via “common sense” is much more tolerant of inaccuracies than the path that goes through optimizing the output of the right/wrong classifier.
- Charlie Steiner 22 Jan 2021 3:16 UTC
  LW: 2 AF: 1
  AF Parent
  My first idea is, you take your common sense AI, and rather than saying “build me a spaceship, but, like, use common sense,” you can tell it “do the right thing, but, like, use common sense.” (Obviously with “saying” and “tell” in invisible finger quotes.) Bam, Type-1 FAI.
  Of course, whether this will go wrong or not depends on the specifics. I’m reminded of Adam Shimi et al’s recent post that mentioned “Ideal Accomplishment” (how close to an explicit goal a system eventually gets) and “Efficiency” (how fast it gets there). If you have a general purpose “common sensical optimizer” that optimizes any goal but, like, does it in a common sense way, then before you turn it on you’d better know whether it’s affecting ideal accomplishment, or just efficiency.
  That is to say, if I tell it to make me the best spaceship it can or something similarly stupid, will the AI “know that the goal is stupid” and only make a normal spaceship before stopping? Or will it eventually turn the galaxy into spaceship, just taking common-sense actions along the way? The truly idiot-proof common sensical optimizer changes its final destination so that it does what we “obviously” meant, not what we actually said. The flaws in this process seem to determine if it’s trustworthy enough to tell to “do the right thing,” or trustworthy enough to tell to do anything at all.