Christopher King comments on Alignment Faking in Large Language Models

Christopher King 18 Dec 2024 21:13 UTC
2 points
3
Now see if you can catch sandbagging in the scratchpad!
- Teun van der Weij 4 Feb 2025 18:45 UTC
  1 point
  0
  Parent
  What do you think the value of this is? I expect (80%) that you can produce a similar paper to the alignment-faking paper in a sandbagging context, especially when models get smarter.
  
  Scientifically there seems to be little value. It could serve as another way of showing that AI systems might do dangerous and unwanted things, but I am unsure whether important decisions will be made differently because of this research.