Vika comments on AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

Vika 23 May 2020 15:15 UTC
LW: 2 AF: 1
AF
I certainly agree that there are problems with the stepwise inaction baseline and it’s probably not the final answer for impact penalization. I should have said that the inaction counterfactual is a natural choice, rather than specifically its stepwise form. Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel). Of course, the inaction baseline has other issues (like offsetting), so I think it’s an open question how to design a baseline that satisfies all the criteria we consider sensible (and whether it’s even possible).
I agree that counterfactuals are hard, but I’m not sure that difficulty can be avoided. Your baseline of “what the human expected the agent to do” is also a counterfactual, since you need to model what would have happened if the world unfolded as expected. It also requires a lot of information from the human, which is subjective and may be hard to elicit. What a human expected to happen in a given situation may not even be well-defined if they have internal disagreement—e.g. even if I feel surprised by someone’s behavior, there is often a voice in my head saying “this was actually predictable from their past behavior so I should have known better”. On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn’t need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.
- Rohin Shah 23 May 2020 18:10 UTC
  LW: 4 AF: 3
  AF Parent
  Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel).
  Maybe? How do you decide where to start the inaction baseline? In RL the episode start is an obvious choice, but it’s not clear how to apply that for humans.
  (I only have this objection when trying to explain what “impact” means to humans; it seems fine in the RL setting. I do think we’ll probably stop relying on the episode abstraction eventually, so we would eventually need to not rely on it ourselves, but plausibly that can be dealt with in the future.)
  Also, under this inaction baseline, the roads are perpetually empty, and so you’re always feeling impact from the fact that you can’t zoom down the road at 120 mph, which seems wrong.
  I agree that counterfactuals are hard, but I’m not sure that difficulty can be avoided. Your baseline of “what the human expected the agent to do” is also a counterfactual, since you need to model what would have happened if the world unfolded as expected.
  Sorry, what I meant to imply was “baselines are counterfactuals, and counterfactuals are hard, so maybe no ‘natural’ baseline exists”. I certainly agree that my baseline is a counterfactual.
  On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn’t need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.
  Yes, that’s my main point. I agree that there’s no clear way to take my baseline and implement it in code, and that it depends on fuzzy concepts that don’t always apply (even when interpreted by humans).