Marius Hobbhahn comments on Frontier Models are Capable of In-context Scheming

Marius Hobbhahn 10 Dec 2024 11:50 UTC
LW: 16 AF: 11
2
AF
Brief comments (shared in private with Joe earlier):
1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation.
2. They are still early results, and we didn’t have a lot of time to investigate them, so we didn’t want to make them the headline result. Due to the natural deadline of the o1 release, we couldn’t do a proper investigation.
3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO.

We will potentially further look into these findings in 2025.
- Outlaw-Spades 9 Feb 2025 3:47 UTC
  1 point
  0
  Parent
  Hey, more a question as someone kind of new to A.I. and the IT world but if these newer models are built in similar fashion to previous models like GPT-4, has there been any inquiry or study into the relation between certain behaviors regarding things like A.I. ‘hallucinations’ where false information is generated and fabricated and these behaviors investigated above?
  
  First (?) Libel-by-AI (ChatGPT) Lawsuit Filed
  
  Using this case as an example where the A.I. repeatedly generated false information and false text from a real lawsuit to support allegations it created while a third-party researcher was investigating a separate court-case. What stood out to me was the repeated tendency to double down on the false information as noted in the points of the lawsuit. Just curious if there are people with more understanding or knowledge on these topics that could maybe clarify whether or not there might be a linkage to those behaviors whether purposefully generated by the LLM or not?