paulfchristiano comments on Answering questions honestly given world-model mismatches

paulfchristiano 31 Jul 2021 2:01 UTC
LW: 3 AF: 2
AF
Note that HumanAnswer and IntendedAnswer do different things. HumanAnswer spreads out its probability mass more, by first making an observation and then taking the whole distribution over worlds that were consistent with it.
Abstracting out Answer, let’s just imagine that our AI outputs a distribution $p$ over the space of trajectories $S$ in the human ontology, and somehow we define a reward function $r (p, ω)$ evaluated by the human in hindsight after getting the observation $ω$ . The idea is that this is calculated by having the AI answer some questions about what it believes etc but we’ll abstract that all out.
Then the conclusion in this post holds under some convexity assumption on $r$ , since then spreading out your mass can’t really hurt you (since the human has no way to prefer your pointy estimate). But e.g. if you just penalized $p$ for being uncertain, then IntendedAnswer could easily outperform HumanAnswer. Similarly, if we require that $p$ satisfy various conditional independence properties then we may rule out HumanAnswer.
The more precise bad behavior InstrumentalAnswer is to output the distribution $arg {max}_{p} E_{ω \sim W^{'}} [r (p, ω)]$ . Of course nothing else is going to get a higher reward. This is about as simple as HumanAnswer. It could end up being slightly more computationally complex. I think everything I’ve said about this case still applies for InstrumentalAnswer, but it’s relevant when I start talking about stuff like conditional independence requirements between the model’s answers.