Rohin Shah comments on AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

Rohin Shah 23 May 2020 18:10 UTC
LW: 4 AF: 3
AF
Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel).
Maybe? How do you decide where to start the inaction baseline? In RL the episode start is an obvious choice, but it’s not clear how to apply that for humans.
(I only have this objection when trying to explain what “impact” means to humans; it seems fine in the RL setting. I do think we’ll probably stop relying on the episode abstraction eventually, so we would eventually need to not rely on it ourselves, but plausibly that can be dealt with in the future.)
Also, under this inaction baseline, the roads are perpetually empty, and so you’re always feeling impact from the fact that you can’t zoom down the road at 120 mph, which seems wrong.
I agree that counterfactuals are hard, but I’m not sure that difficulty can be avoided. Your baseline of “what the human expected the agent to do” is also a counterfactual, since you need to model what would have happened if the world unfolded as expected.
Sorry, what I meant to imply was “baselines are counterfactuals, and counterfactuals are hard, so maybe no ‘natural’ baseline exists”. I certainly agree that my baseline is a counterfactual.
On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn’t need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.
Yes, that’s my main point. I agree that there’s no clear way to take my baseline and implement it in code, and that it depends on fuzzy concepts that don’t always apply (even when interpreted by humans).