Rohin Shah comments on A naive alignment strategy and optimism about generalization

Rohin Shah 27 Jul 2021 12:27 UTC
LW: 2 AF: 2
AF
Planned summary for the Alignment Newsletter:
We want to build an AI system that answers questions honestly, to the best of its ability. One obvious approach is to have humans generate answers to questions, select the question-answer pairs where we are most confident in the answers, and train an AI system on those question-answer pairs.
(I’ve described this with a supervised learning setup, but we don’t have to do that: we could also [learn](https://deepmind.com/blog/learning-through-human-feedback/) from [comparisons](https://ai-alignment.com/optimizing-with-comparisons-c02b8c0d7877) between answers, and we only provide comparisons where we are confident in the comparison.)
What will the AI system do on questions where we _wouldn’t_ be confident in the answers? For example, questions that are complex, where we may be misled by bad observations, where an adversary is manipulating us, etc.
One possibility is that the AI system learned the **intended policy**, where it answers questions honestly to the best of its ability. However, there is an **instrumental policy** which also gets good performance: it uses a predictive model of the human to say whatever a human would say. (This is “instrumental” in that the model is taking the actions that are instrumental to getting a low loss, even in the test environment.) This will give incorrect answers on complex, misleading, or manipulative questions -- _even if_ the model “knows” that the answer is incorrect.
Intuitively, “answer as well as you can” feels like a much simpler way to give correct answers, and so we might expect to get the intended policy rather than the instrumental policy. This view (which seems common amongst ML researchers) is _optimism about generalization_: we are hoping that the policy generalizes to continue to answer these more complex, misleading, manipulative questions to the best of its ability.
Are there reasons to instead be pessimistic about generalization? There are at least three:
1. If the answers we train on _aren’t_ perfectly correct, the instrumental policy might get a _lower_ training loss than the intended policy (which corrects errors that humans make), and so be more likely to be found by gradient descent.
2. If the AI already needs to make predictions about humans, it may not take much “additional work” to implement the instrumental policy. Conversely, if the AI reasons at a different level of abstraction than humans, it may take a lot of “additional work” to turn correct answers in the AI’s ontology into correct answers in human ontologies.
3. From [a followup post](https://www.alignmentforum.org/posts/SRJ5J9Tnyq7bySxbt/answering-questions-honestly-given-world-model-mismatches), the AI system might answer questions by translating its concepts to human concepts or observations, and then deduce the answer from those concepts or observations. This will systematically ignore information that the AI system understands that isn’t represented in the human concepts or observations. (Consider the [example](https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/) of the robot hand that only _looked_ like it was grasping the appropriate object.)
A possible fourth problem: if the AI system did the deduction in its own concepts and only as a final step translated it to human concepts, we might _still_ lose relevant information. This seems not too bad though—it seems like we should at least be able to <@explain the bad effects of a catastrophic failure@>(@Can there be an indescribable hellworld?@) in human concepts, even if we can’t explain why that failure occurred.
A [followup post](https://www.alignmentforum.org/posts/roZvoF6tRH6xYtHMF/avoiding-the-instrumental-policy-by-hiding-information-about) considers whether we could avoid the instrumental policy by <@preventing it from learning information about humans@>(@Thoughts on Human Models@), but concludes that while it would solve the problems outlined in the post, it seems hard to implement in practice.