Tom Davidson comments on The “no sandbagging on checkable tasks” hypothesis

Tom Davidson 1 Aug 2023 15:03 UTC
LW: 4 AF: 3
0
AF
I read “capable of X” as meaning something like “if the model was actively trying to do X then it would do X”. I.e. a misaligned model doesn’t reveal the vulnerability to humans during testing bc it doesn’t want them to patch it, but then later it exploits that same vulnerability during deployment bc it’s trying to hack the computer system
- Rohin Shah 2 Aug 2023 13:28 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Which of (1)-(7) above would falsify the hypothesis if observed? Or if there isn’t enough information, what additional information do you need to tell whether the hypothesis has been falsified or not?