Rohin Shah comments on Garrabrant and Shah on human modeling in AGI

Rohin Shah 4 Aug 2021 6:27 UTC
LW: 24 AF: 13
AF
Planned summary for the Alignment Newsletter:
This is a conversation between Scott and me about the <@relative dangers of human modeling@>(@Thoughts on Human Models@), moderated by Eli Tyre. From a safety perspective, the main reason to _avoid_ human modeling is that the agent’s cognition will be much “further” away from manipulation of humans; for example, it seems more unlikely that your AI system tricks people into launching nukes, if it never learned very much about humans in the first place. The main counterargument is that this precludes using human oversight of agent cognition (since when humans are overseeing the agent’s cognition, then the agent is likely to learn about humans in order to satisfy that oversight); this human oversight could plausibly greatly increase safety. It also seems like systems that don’t model humans will have a hard time performing many useful tasks, though the conversation mostly did not touch upon this point.

Scott’s position is that given there are these two quite different risks (manipulation worries, vs. learning the wrong cognition due to poor oversight), it seems worthwhile to put some effort into addressing each risk, and avoiding human models is much more neglected than improving human oversight. My position is that it seems much less likely that there is a plausible success path where we do very little human modeling, and so I want a lot more work along the oversight path. I _do_ think that it is worth differentially pushing AI systems towards tasks that don’t require much human modeling, e.g. physics and engineering, rather than ones that do, e.g. sales and marketing, but this seems roughly independent of technical work, at least currently.