TurnTrout comments on AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

TurnTrout 23 May 2020 20:47 UTC
LW: 6 AF: 4
AF

After rereading the sequence and reflecting on this further, I disagree with your interpretation of the Reframing Impact concept of impact. The concept is “change in my ability to get what I want”, i.e. change in the true human utility function. This is a broad statement that does not specify how to measure “change”, in particular what it is measured with respect to (the baseline) or how to take the difference from the baseline (e.g. whether to apply absolute value). Your interpretation of this statement uses the previous state as a baseline and does not apply an absolute value to the difference. This is a specific and nonstandard instantiation of the impact concept, and the undesirable property you described does not hold for other instantiations—e.g. using a stepwise inaction baseline and an absolute value: Impact(s, a) = |E[V(s, a)] - E[V(s, noop)]|. So I don’t think it’s fair to argue based on this instantiation that it doesn’t make sense to regularize the RI notion of impact.

AU theory says that people feel impacted as new observations change their on-policy value estimate (so it’s the TD error). I agree with Rohin’s interpretation as I understand it.

However, AU theory is descriptive – it describes when and how we feel impacted, but not how to build agents which don’t impact us much. That’s what the rest of the sequence talked about.