paulfchristiano comments on Prize for probable problems

paulfchristiano 20 Mar 2018 5:37 UTC
4 points
In general, if you have some useful but potentially malign data source (humans, in the translation example) then that’s a possible problem—whether you learn from the data source or merely consult it.
You have to solve each instance of that problem in a way that depends on the details of the data source. In the translation example, you need to actually reason about human psychology. In the case of SETI, we need to coordinate to not use malign alien messages (or else opt to let the aliens take over).
Otherwise, aren’t you “cheating” by letting aligned AIs use AI techniques that their competitors aren’t allowed to use?
I’m just trying to compete with a particular set of AI techniques. Then every time you would have used those (potentially dangerous) techniques, you can instead use the safe alternative we’ve developed.
If there are other ways to make your AI more powerful, you have to deal with those on your own. That may be learning from human abilities that are entangled with malign behavior in complex ways, or using an AI design that you found in an alien message, or using an unsafe physical process in order to generate large amounts of power, or whatever.
I grant that my definition of the alignment problem would count “learn from malign data source” as an alignment problem, since you ultimately end up with a malign AI, but that problem occurs with or without AI and I don’t think it is deceptive to factor that problem out (but I agree that I should be more careful about the statement / switch to a more refined statement).
I also don’t think it’s a particularly important problem. And it’s not what people usually have in mind as a failure mode—I’ve discussed this problem with a few people, to try to explain some subtleties of the alignment problem, and most people hadn’t thought about it and were pretty skeptical. So in those respects I think it’s basically fine.
When Ajeya says:
provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance.
This is meant to include things like “You don’t have a malign data source that you are learning from.” I agree that it’s slightly misleading if we think that humans are such a data source.