Tom DAVID comments on A list of core AI safety problems and how I hope to solve them

Tom DAVID 26 Aug 2023 16:34 UTC
3 points
0
“Instead of building in a shutdown button, build in a shutdown timer.”

-> Isn’t that a form of corrigibility with an added constraint? I’m not sure what would prevent you from convincing humans that it’s a bad thing to respect the timer, for example. Is it because we’ll formally verify we avoid deception instance? It’s not clear to me but maybe I’ve misunderstood.
- davidad 26 Aug 2023 23:06 UTC
  6 points
  0
  Parent
  A system with a shutdown timer, in my sense, has no terms in its reward function which depend on what happens after the timer expires. (This is discussed in more detail in my previous post.) So there is no reason to persuade humans or do anything else to circumvent the timer, unless there is an inner alignment failure (maybe that’s what you mean by “deception instance”). Indeed, it is the formal verification that prevents inner alignment failures.
- Jordan Taylor 26 Aug 2023 23:37 UTC
  1 point
  0
  Parent
  I guess the shutdown timer would be most important in the training stage, so that it (hopefully) learns only to care about the short term.
  - davidad 27 Aug 2023 6:21 UTC
    2 points
    0
    Parent
    Yes, the “shutdown timer” mechanism is part of the policy-scoring function that is used during policy optimization. OAA has multiple stages that could be considered “training”, and policy optimization is the one that is closest to the end, so I wouldn’t call it “the training stage”, but it certainly isn’t the deployment stage.
    
    We hope not merely that the policy only cares about the short term, but also that it cares quite a lot about gracefully shutting itself down on time.