Brief comments (shared in private with Joe earlier): 1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation. 2. They are still early results, and we didn’t have a lot of time to investigate them, so we didn’t want to make them the headline result. Due to the natural deadline of the o1 release, we couldn’t do a proper investigation. 3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO.
We will potentially further look into these findings in 2025.
Brief comments (shared in private with Joe earlier):
1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation.
2. They are still early results, and we didn’t have a lot of time to investigate them, so we didn’t want to make them the headline result. Due to the natural deadline of the o1 release, we couldn’t do a proper investigation.
3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO.
We will potentially further look into these findings in 2025.