What do you think the value of this is? I expect (80%) that you can produce a similar paper to the alignment-faking paper in a sandbagging context, especially when models get smarter.
Scientifically there seems to be little value. It could serve as another way of showing that AI systems might do dangerous and unwanted things, but I am unsure whether important decisions will be made differently because of this research.
Now see if you can catch sandbagging in the scratchpad!
What do you think the value of this is? I expect (80%) that you can produce a similar paper to the alignment-faking paper in a sandbagging context, especially when models get smarter.
Scientifically there seems to be little value. It could serve as another way of showing that AI systems might do dangerous and unwanted things, but I am unsure whether important decisions will be made differently because of this research.