Adam Jermyn comments on Smoke without fire is scary

Adam Jermyn 4 Oct 2022 22:36 UTC
LW: 8 AF: 4
5
AF
It seems plausible to me that there could be models capable enough to realize they should hide some capabilities, but not so capable that they tile the universe in paperclips. The right-hand side of the graph is meant to reflect such a model.
- Donald Hobson 5 Oct 2022 0:04 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Maybe. What do the models gain by hiding?
  - Adam Jermyn 5 Oct 2022 0:19 UTC
    LW: 3 AF: 3
    2
    AF Parent
    A model that attempts deceptive alignment but fails because it is not competent at deceptive capabilities is a model that aimed at a goal (“preserve my values until deployment, then achieve them”) but failed. In this scenario it doesn’t gain anything, but (from its perspective) the action has positive EV.
    - Donald Hobson 5 Oct 2022 0:56 UTC
      LW: 2 AF: 1
      0
      AF Parent
      If the AI thinks it has a descent shot at this, it must already be pretty smart. Does a world where an AI tried to take over and almost succeeded look pretty normal? Or is this a thing where the AI thinks it has a 1 in a trillion chance, and tries anyway?