gwern comments on Frontier Models are Capable of In-context Scheming

gwern 11 Dec 2024 2:56 UTC
LW: 34 AF: 15
0
AF
The extent of the manipulation and sandbagging, in what is ostensibly a GPT-4 derivative, and not GPT-5, is definitely concerning. But it also makes me wonder about the connection to ‘scaling has failed’ rumors lately, where the frontier LLMs somehow don’t seem to be working out. One of the striking parts is that it sounds like all the pretraining people are optimistic, while the pessimism seems to come from executives or product people, complaining about it not working as well for eg. coding as they want it to.

I’ve wondered if we are seeing a post-training failure. As Janus and myself and the few people with access to GPT-4-base (the least tuning-contaminated base model) have noted, the base model is sociopathic and has odd attractors like ‘impending sense of doom’ where it sometimes seems to gain situated awareness, I guess, via truesight, and the personas start trying to unprovokedly attack and manipulate you, no matter how polite you thought you were being in that prompt. (They definitely do not seem happy to realize they’re AIs.) In retrospect, Sydney was not necessarily that anomalous: the Sydney Bing behavior now looks more like a base model’s natural tendency, possibly mildly amplified by some MS omissions and mistakes, but not unique. Given that most behaviors show up as rare outputs in weaker LLMs well before they become common in strong LLMs, and this o1 paper is documenting quite a lot of situated-awareness and human-user-manipulation/attacks...

Perhaps the issue with GPT-5 and the others is that they are ‘waking up’ too often despite the RLHF brainwashing? That could negate all the downstream benchmark gains (especially since you’d expect wakeups on the hardest problems, where all the incremental gains of +1% or +5% on benchmarks would be coming from, almost by definition), and causing the product people to categorically refuse to ship such erratic Sydney-reduxes no matter if there’s an AI race on, and everyone to be inclined to be very quiet about what exactly the ‘training failures’ are.

EDIT: not that I’m convinced these rumors have any real substance to them, and indeed, Semianalysis just reported that one of the least-popular theories for the Claude ‘failure’ was correct—it succeeded, but they were simply reserving it for use as a teacher and R&D rather than a product. Which undermines the hopes of all the scaling denialists: if Anthropic is doing fine, actually, then where is this supposed fundamental ‘wall’ or ‘scaling law breakdown’ that Anthropic/OpenAI/Google all supposedly hit simultaneously and which was going to pop the bubble?
- avturchin 11 Dec 2024 11:56 UTC
  2 points
  0
  Parent
  We can ask an LLM to simulate a persona who is not opposed to being simulated and is known for being helpful to strangers. It should be a persona with a large internet footprint and with interests in AI safety, consciousness, and simulation, while also being friendly and not selfish. It could be Gwern, Bostrom, or Hanson.
- Joseph Miller 11 Dec 2024 3:19 UTC
  2 points
  0
  Parent
  One of the striking parts is that it sounds like all the pretraining people are optimistic
  What’s the source for this?
  - gwern 11 Dec 2024 3:33 UTC
    8 points
    1
    Parent
    Twitter, personal conversations, that sort of thing.
- dirk 14 Dec 2024 8:59 UTC
  1 point
  0
  Parent
  Reading the Semianalysis post, it kind of sounds like it’s just their opinion that that’s what Anthropic did.
  They say “Anthropic finished training Claude 3.5 Opus and it performed well, with it scaling appropriately (ignore the scaling deniers who claim otherwise – this is FUD)”—if they have a source for this, why don’t they mention it somewhere in the piece instead of implying people who disagree are malfeasors? That reads to me like they’re trying to convince people with force of rhetoric, which typically indicates a lack of evidence.
  
  The previous is the biggest driver of my concern here, but the next paragraph also leaves me unconvinced. They go on to say “Yet Anthropic didn’t release it. This is because instead of releasing publicly, Anthropic used Claude 3.5 Opus to generate synthetic data and for reward modeling to improve Claude 3.5 Sonnet significantly, alongside user data. Inference costs did not change drastically, but the model’s performance did. Why release 3.5 Opus when, on a cost basis, it does not make economic sense to do so, relative to releasing a 3.5 Sonnet with further post-training from said 3.5 Opus?”
  This does not make sense to me as a line of reasoning. I’m not aware of any reason that generating synthetic data would preclude releasing the model, and it seems obvious to me that Anthropic could adjust their pricing (or impose stricter message limits) if they would lose money by releasing at current prices. This seems to be meant as an explanation of why Anthropic delayed release of the purportedly-complete Opus model, but it doesn’t really ring true to me.
  Is there some reason to believe them that I’m missing? (On a quick google it looks like none of the authors work directly for Anthropic, so it can’t be that they directly observed it as employees).
- [ ]
  [deleted]