I read “capable of X” as meaning something like “if the model was actively trying to do X then it would do X”. I.e. a misaligned model doesn’t reveal the vulnerability to humans during testing bc it doesn’t want them to patch it, but then later it exploits that same vulnerability during deployment bc it’s trying to hack the computer system
Which of (1)-(7) above would falsify the hypothesis if observed? Or if there isn’t enough information, what additional information do you need to tell whether the hypothesis has been falsified or not?
I read “capable of X” as meaning something like “if the model was actively trying to do X then it would do X”. I.e. a misaligned model doesn’t reveal the vulnerability to humans during testing bc it doesn’t want them to patch it, but then later it exploits that same vulnerability during deployment bc it’s trying to hack the computer system
Which of (1)-(7) above would falsify the hypothesis if observed? Or if there isn’t enough information, what additional information do you need to tell whether the hypothesis has been falsified or not?