Charlie Steiner comments on AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

Charlie Steiner 18 Feb 2021 9:37 UTC
LW: 2 AF: 1
AF
Hm, I thought that was what Evan called it, but maybe I misheard. Anyhow, I mean the problem where because you can model humans in different ways, we have no unique utility function. We might think of this as having not just one Best Intentional Stance, but a generalizable intentional stance with knobs and dials on it, different settings of which might lead to viewing the subject in different ways.

I call such real-world systems that can be viewed non-uniquely through the lens of the intentional stance “approximate agents.”

To the extent that mesa-optimizers are approximate agents, this raises familiar and difficult problems with interpretability. Checking how good an approximation is can require knowing about the environment it will get put into, which (that being the future) is hard.