Marius Hobbhahn comments on Frontier Models are Capable of In-context Scheming

Marius Hobbhahn 10 Dec 2024 11:50 UTC
LW: 16 AF: 11
2
AF
Brief comments (shared in private with Joe earlier):
1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation.
2. They are still early results, and we didn’t have a lot of time to investigate them, so we didn’t want to make them the headline result. Due to the natural deadline of the o1 release, we couldn’t do a proper investigation.
3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO.

We will potentially further look into these findings in 2025.