TurnTrout comments on Subagents and impact measures, full and fully illustrated

TurnTrout 28 Feb 2020 15:18 UTC
LW: 2 AF: 1
AF

Then we get an agent with an incentive to stop any human present in the environment from becoming too good

No, this modification stops people from actually optimizing $R$ if the world state is fully observable. If it’s partially observable, this actually seems like a pretty decent idea.

In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.

I disagree. First, we already have evidence that simple measures scale just fine to complex environments. Second, “responsibility” is a red herring in impact measurement. I wrote the Reframing Impact sequence to explain why I think the conceptual solution to impact measurement is quite simple.