paulfchristiano comments on Can corrigibility be learned safely?

paulfchristiano 17 Apr 2018 22:43 UTC
LW: 2 AF: 1
AF
in order to reach some target capability via amplification, you may have to go through another capability that is harder for ML to learn
This is a general restriction on iterated amplification. Without this restriction it would be mostly trivial—whatever work we could do to build aligned AI, you could just do inside HCH, then delegate the decision to the resulting aligned AI.
In this case, the target capability is translation, and the intermediate capability is linguistic knowledge and skills. (We currently have ML that can learn to translate, but AFAIK not learn how to apply linguistics to recreate the ability to translate.)
If your AI is able to notice an empirical correlation (e.g. word A cooccurs with word B), and lacks the capability to understand anything at all about the causal structure of that correlation, then you have no option but to act on the basis of the brute association, i.e. to take the action that looks best to you in light of that correlation, without conditioning on other facts about the causal structure of the association, since by hypothesis your system is not capable enough to recognize those other facts.
If we have an empirical association between behavior X (pressing a sequence of buttons related in a certain way to what’s in memory) and our best estimate of utility, we might end up needing to take that action without understanding what’s going on causally. I’m still happy calling this aligned in general: the exact same thing would happen to a perfectly motivated human assistant trying their best to do what you want, who was able to notice an empirical correlation but was not smart enough to notice anything about the underlying mechanism (and sometimes acting on the basis of such correlations will be bad).
In order to argue that our AI leads to good outcomes, we need to make an assumption not only about alignment but about capability. If the system is aligned it will be trying its best to make use of all of the information it has to respond appropriately to the observed correlation, to behave cautiously in light of that uncertainty, etc.. But in order to get a good outcome, and even in order to avoid a catastrophic outcome, we need to make some assumptions about “what the AI is able to notice.”
(Ideally IDA could eventually serve as an adequate operationalization of “smart enough to understand X” and similar properties.)
These include assumptions like “if the AI is able to cook up a plan that gets high reward because it kills the human, the AI is likely to be able to notice that the plan involves killing the human” and “the AI is smart enough to understand that killing the human is bad, or sufficiently risky that it is worth behaving cautiously and check with the human” and “the AI is smart enough that it can understand when the human says `X is bad’.” Some of these we can likely verify empirically. Some of them will require more work to even state cleanly. And there will be some situations where these assumptions simply aren’t true, e.g. because there is an unfortunate fact about the world that introduces the linkage (plan X kills humans) --> (plan X looks good on paper) without telling you anything about why.
I’m currently considering these problems out of scope for me because (a) there seems to be no way to have a clever idea about AI that avoids this family of problems without sacrificing competitiveness, (b) they would occur with a well-motivated human assistant, (c) we don’t have much reason to suspect that they are particularly serious problems compared to other kinds of mistakes an AI might make.
(I don’t really care whether we call them “alignment” problems per se, though I’m proposing defining alignment such that they wouldn’t be.)