We want to build an AI system that answers questions honestly, to the best of its ability. One obvious approach is to have humans generate answers to questions, select the question-answer pairs where we are most confident in the answers, and train an AI system on those question-answer pairs.
What will the AI system do on questions where we _wouldn’t_ be confident in the answers? For example, questions that are complex, where we may be misled by bad observations, where an adversary is manipulating us, etc.
One possibility is that the AI system learned the **intended policy**, where it answers questions honestly to the best of its ability. However, there is an **instrumental policy** which also gets good performance: it uses a predictive model of the human to say whatever a human would say. (This is “instrumental” in that the model is taking the actions that are instrumental to getting a low loss, even in the test environment.) This will give incorrect answers on complex, misleading, or manipulative questions -- _even if_ the model “knows” that the answer is incorrect.
Intuitively, “answer as well as you can” feels like a much simpler way to give correct answers, and so we might expect to get the intended policy rather than the instrumental policy. This view (which seems common amongst ML researchers) is _optimism about generalization_: we are hoping that the policy generalizes to continue to answer these more complex, misleading, manipulative questions to the best of its ability.
Are there reasons to instead be pessimistic about generalization? There are at least three:
1. If the answers we train on _aren’t_ perfectly correct, the instrumental policy might get a _lower_ training loss than the intended policy (which corrects errors that humans make), and so be more likely to be found by gradient descent.
2. If the AI already needs to make predictions about humans, it may not take much “additional work” to implement the instrumental policy. Conversely, if the AI reasons at a different level of abstraction than humans, it may take a lot of “additional work” to turn correct answers in the AI’s ontology into correct answers in human ontologies.
A possible fourth problem: if the AI system did the deduction in its own concepts and only as a final step translated it to human concepts, we might _still_ lose relevant information. This seems not too bad though—it seems like we should at least be able to <@explain the bad effects of a catastrophic failure@>(@Can there be an indescribable hellworld?@) in human concepts, even if we can’t explain why that failure occurred.
Planned summary for the Alignment Newsletter: