Great interview! Weird question—did Rob Miles get a sneak peek at this interview, given that he just did a video on the same paper?
The biggest remaining question I have is a followup on the question you asked “Am I a mesa-optimizer, and if so, what’s my meta-objective?” You spend some time talking about lookup tables, but I wanted to hear about human-esque “agents” that seem like they do planning, but simultaneously have a very serious determination problem for their values—is Evan’s idea to try to import some “solution to outer alignment” to these agents (note that such a solution can’t be like HCH)? Or some serial vs. parallel compute argument that human-level underdetermination of values will be rare (though this argument seems dubious)?
Hm, I thought that was what Evan called it, but maybe I misheard. Anyhow, I mean the problem where because you can model humans in different ways, we have no unique utility function. We might think of this as having not just one Best Intentional Stance, but a generalizable intentional stance with knobs and dials on it, different settings of which might lead to viewing the subject in different ways.
I call such real-world systems that can be viewed non-uniquely through the lens of the intentional stance “approximate agents.”
To the extent that mesa-optimizers are approximate agents, this raises familiar and difficult problems with interpretability. Checking how good an approximation is can require knowing about the environment it will get put into, which (that being the future) is hard.
Great interview! Weird question—did Rob Miles get a sneak peek at this interview, given that he just did a video on the same paper?
The biggest remaining question I have is a followup on the question you asked “Am I a mesa-optimizer, and if so, what’s my meta-objective?” You spend some time talking about lookup tables, but I wanted to hear about human-esque “agents” that seem like they do planning, but simultaneously have a very serious determination problem for their values—is Evan’s idea to try to import some “solution to outer alignment” to these agents (note that such a solution can’t be like HCH)? Or some serial vs. parallel compute argument that human-level underdetermination of values will be rare (though this argument seems dubious)?
If Rob got a sneak peek, he managed to do so without my knowledge.
I don’t totally understand the other question you’re asking: in particular, what you’re thinking of as the “determination problem”.
Hm, I thought that was what Evan called it, but maybe I misheard. Anyhow, I mean the problem where because you can model humans in different ways, we have no unique utility function. We might think of this as having not just one Best Intentional Stance, but a generalizable intentional stance with knobs and dials on it, different settings of which might lead to viewing the subject in different ways.
I call such real-world systems that can be viewed non-uniquely through the lens of the intentional stance “approximate agents.”
To the extent that mesa-optimizers are approximate agents, this raises familiar and difficult problems with interpretability. Checking how good an approximation is can require knowing about the environment it will get put into, which (that being the future) is hard.