Vika comments on AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

Vika 16 May 2020 17:54 UTC
LW: 2 AF: 1
AF
I think the previous state is a natural baseline if you are interested in the total impact on the human from all sources. If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).
As an example, suppose I have an unpleasant ride on a crowded bus, where person X steps on my foot and person Y steals my wallet. The total impact on me would be computed relative to the previous state before I got on the bus, which would include both my foot and my wallet. The impact of person X on me would be computed relative to the stepwise inaction baseline, where person X does nothing (but person Y still steals my wallet), and vice versa.
When we use impact as a regularizer, we are interested in the impact caused by the agent, so we use the stepwise inaction baseline. It wouldn’t make sense to use total impact as a regularizer, since it would penalize the agent for impact from all sources.
- Rohin Shah 16 May 2020 18:11 UTC
  LW: 4 AF: 3
  AF Parent
  If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).
  To the extent that there is a natural choice (counterfactuals are hard), I think it would be “what the human expected the agent to do” (the same sort of reasoning that led to the previous state baseline).
  This gives the same answer as the stepwise inaction baseline in your example (because usually we don’t expect a specific person to step on our feet or to steal our wallet).
  An example where it gives a different answer is in driving. The stepwise inaction baseline says “impact is measured relative to all the other drivers going comatose”, so in the baseline state many accidents happen, and you get stuck in a huge traffic jam. Thus, all the other drivers are constantly having a huge impact on you by continuing to drive!
  In contrast, the baseline of “what the human expected the agent to do” gets the intuitive answer—the human expected all the other drivers to drive normally, and so normal driving has ~zero impact, whereas if someone actually did fall comatose and cause an accident, that would be quite impactful.
  EDIT: Tbc, I think this is the “natural choice” if you want to predict what humans would say is impactful; I don’t have a strong opinion on what the “natural choice” would be if you wanted to successfully prevent catastrophe via penalizing “impact”. (Though in this case the driving example still argues against stepwise inaction.)
  - Vika 23 May 2020 15:15 UTC
    LW: 2 AF: 1
    AF Parent
    I certainly agree that there are problems with the stepwise inaction baseline and it’s probably not the final answer for impact penalization. I should have said that the inaction counterfactual is a natural choice, rather than specifically its stepwise form. Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel). Of course, the inaction baseline has other issues (like offsetting), so I think it’s an open question how to design a baseline that satisfies all the criteria we consider sensible (and whether it’s even possible).
    I agree that counterfactuals are hard, but I’m not sure that difficulty can be avoided. Your baseline of “what the human expected the agent to do” is also a counterfactual, since you need to model what would have happened if the world unfolded as expected. It also requires a lot of information from the human, which is subjective and may be hard to elicit. What a human expected to happen in a given situation may not even be well-defined if they have internal disagreement—e.g. even if I feel surprised by someone’s behavior, there is often a voice in my head saying “this was actually predictable from their past behavior so I should have known better”. On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn’t need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.
    - Rohin Shah 23 May 2020 18:10 UTC
      LW: 4 AF: 3
      AF Parent
      Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel).
      Maybe? How do you decide where to start the inaction baseline? In RL the episode start is an obvious choice, but it’s not clear how to apply that for humans.
      (I only have this objection when trying to explain what “impact” means to humans; it seems fine in the RL setting. I do think we’ll probably stop relying on the episode abstraction eventually, so we would eventually need to not rely on it ourselves, but plausibly that can be dealt with in the future.)
      Also, under this inaction baseline, the roads are perpetually empty, and so you’re always feeling impact from the fact that you can’t zoom down the road at 120 mph, which seems wrong.
      I agree that counterfactuals are hard, but I’m not sure that difficulty can be avoided. Your baseline of “what the human expected the agent to do” is also a counterfactual, since you need to model what would have happened if the world unfolded as expected.
      Sorry, what I meant to imply was “baselines are counterfactuals, and counterfactuals are hard, so maybe no ‘natural’ baseline exists”. I certainly agree that my baseline is a counterfactual.
      On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn’t need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.
      Yes, that’s my main point. I agree that there’s no clear way to take my baseline and implement it in code, and that it depends on fuzzy concepts that don’t always apply (even when interpreted by humans).