Koen.Holtman comments on Stop button: towards a causal solution

Koen.Holtman 6 Dec 2021 21:40 UTC
1 point
I just ran across this post and your more recent post. (There are long time intervals when I do not read LW/AF.) Some quick thoughts/pointers:

I haven’t properly followed AI safety for a while. I don’t know if this idea is original.

On the surface, this looks very similar to Armstrong’s indifference methods and/or Counterfactual Planning. Maybe also related to LCDT.

However, I cannot really tell how similar it is to any of these, because I can’t follow your natural language+mathematical descriptions of how exactly your construct your intended counterfactuals. When I mean ‘can’t follow ’, I mean I do not understand your description of your design enough that I could actually implement your method in a simple ML system in a simple toy world.

This lack of exhaustive detail is a more common problem when counterfactuals are discussed here, but it can be solved. If you want to get some examples of how to write detailed and exhaustive explanations of counterfactual stop button designs and how they relate to machine learning, see here and here.
- tailcalled 6 Dec 2021 22:22 UTC
  1 point
  Parent
  On the surface, this looks very similar to Armstrong’s indifference methods
  I think my proposal is very different from Armstrong’s proposal, because I am proposing counterfactuals over human behavior (or human values, or similar), while Armstrong is proposing counterfactuals over the AI’s values.
  and/or Counterfactual Planning.
  I’ve only briefly had time to skim this as it is late in my timezone, but this seems somewhat more similar to my proposal than Armstrong’s is. However, if I’m reading this right, it proposes counterfactuals over the physical world, whereas I am proposing counterfactuals over human behavior.
  I think my proposal is better suited for a sort of “affirmative cooperation” where it assists with shutdown when people want to stop it, and less likely to run into “nearest unblocked strategy” concerns, but also that my proposal faces more OOD problems than yours.
  Maybe also related to LCDT.
  If I’m reading it correctly, and thinking about it correctly, then LCDT wouldn’t work for multi-step plans at all, except in very “nice” and isolated environments.
  However, I cannot really tell how similar it is to any of these, because I can’t follow your natural language+mathematical descriptions of how exactly your construct your intended counterfactuals. When I mean ‘can’t follow ’, I mean I do not understand your description of your design enough that I could actually implement your method in a simple ML system in a simple toy world.
  For the purpose of a toy world ML system, maybe this will help:
  In a gridworld, the AI might model the human as an optimizer of a reward function, with some (probably high) exploration rate (just to make this model not immediately disproven by real-world human behavior, which probably is not an optimizer of a reward function). In particular, the counterfactuals correspond to swapping out the AI’s humanmodel’s reward function with one that gets either +reward or -reward for pressing the stop button.
  Not sure whether my different explanation here helps? 😅
  It should be noted that this proposal is somewhat different from the proposal in the post, due to this proposal being optimized for fitting in well to a gridworld, while the post’s proposal is more intended to fit in to a real-world deployed system.
  This lack of exhaustive detail is a more common problem when counterfactuals are discussed here, but it can be solved. If you want to get some examples of how to write detailed and exhaustive explanations of counterfactual stop button designs and how they relate to machine learning, see here and here.
  Thanks, I will look into this (when I have more time, i.e. not this evening, as I need to go to sleep 😅).