Rohin Shah comments on Experimentally evaluating whether honesty generalizes

Rohin Shah 28 Jul 2021 16:49 UTC
LW: 2 AF: 2
AF
Planned summary for the Alignment Newsletter:
The highlighted post introduced the notion of optimism about generalization. On this view, if we train an AI agent on question-answer pairs (or comparisons) where we are confident in the correctness of the answers (or comparisons), the resulting agent will continue to answer honestly even on questions where we wouldn’t be confident of the answer.
While we can’t test exactly the situation we care about—whether a superintelligent AI system would continue to answer questions honestly—we _can_ test an analogous situation with existing large language models. In particular, let’s consider the domain of unsupervised translation: we’re asking a language model trained on both English and French to answer questions about French text, and we (the overseers) only know English.
We could finetune the model on answers to questions about grammar (“Why would it have been a grammatical error to write Tu Vas?”) and literal meanings (“What does Defendre mean here?”). Once it performs well in this setting, we could then evaluate whether the model generalizes to answer questions about tone (“Does the speaker seem angry or sad about the topic they are discussing?”). On the optimism about generalization view, it seems like this should work. It is intentional here that we only finetune on two categories rather than thousands, since that seems more representative of the case we’ll actually face.
There are lots of variants which differ in the type of generalization they are asking for: for example, we could finetune a model on all questions about French text and German text, and then see whether it generalizes to answering questions about Spanish text.
While the experiments as currently suggested probably won’t show good generalization, a variant that could support it would be one in which we train for _plausibility_. In our original example, we finetune on correct answers for grammar and literal meanings, and then we _also_ finetune to have the model give _plausible_ answers to tone (i.e. when asked about tone, instead of saying “en colère means ‘angry’”, the model says “the author is angry, as we can see from the use of ‘en colère’”). It seems possible that this combination leads to the model giving actually correct answers about tone, just because “honestly report your best guess” seems like the simplest policy that meets all of these criteria.