If I remember correctly that’s what I did.
Fabien Roger
is brittle
You might be interested in an Appendix with further evaluations about whether LLMs really believe what they are told, which provides some evidence that the false fact is not believed in a very superficial way.
While I think these evals provide decent evidence about what the LLMs in the experiment “actually believe” (I don’t expect there to be natural situations where these weak models figure out the facts are false in a single forward pass when they “try harder”), I am pessimistic that we would be able to trust this sort of behavioral evals against powerful and situationally aware LLMs.
One eval which I think would give much stronger guarantees even against powerful and situationally aware LLMs is iid training (teach 100 false facts, train the model to distinguish true from false facts using 50 of them, and see if you generalize to the other 50), similar to what Aghyad and I did in The case for unlearning that removes information from LLM weights.
I think this is a much harder standard to meet (I think the models in the experiments in the blog post don’t meet this standard, just like the LLMs that were taught false facts in my unlearning paper), but I think it gives you the guarantees you want (because I think iid training is very sample efficient at making LLMs “actually try”).
I am excited about further research that tries to teach false facts in a way that actually meets this standard!
the main question seems to me to be whether it is plausible that US AI companies would spend more on safety
Other considerations:
Maybe more sharing between US companies leads to a faster progress of the field overall
though maybe it slows it down by reducing investments because investing in algorithmic secrets is less valuable? That’s a bit a 4D chess consideration, I don’t know how to trust this sort of reasoning.
Maybe you want information to flow from the most reckless companies to the less reckless ones, but not the other way around, such that you would prefer if the companies you expect to spend the most on safety to not share information. Spending more on info-sec is maybe correlated with spending more on safety in general, therefore you might be disfavoring less reckless actors by asking for transparency.
(I am also unsure how the public will weigh in—I think there is a 2%-20% chance that public pressure is net negative in terms of safety spending because of PR, legal and AI economy questions. I think it’s hard to tell in advance.)
I don’t think these are super strong considerations and I am sympathetic to the point about safety spend probably increasing if there was more transparency.
The proposal is to use monitoring measures, similar to e.g. constitutional classifiers. Also, don’t we reduce misuse risk a bunch by only deploying to 10k external researchers?
My bad, I failed to see that what’s annoying with helpful-only model sharing is that you can’t check if activity is malicious or not. I agree you can do great monitoring, especially if you can also have a small-ish number of tokens per researcher and have humans audit ~0.1% of transcripts (with AI assistance).
I used to think that announcing AGI milestones would cause rivals to accelerate and race harder; now I think the rivals will be racing pretty much as hard as they can regardless. And in particular, I expect that the CCP will find out what’s happening anyway, regardless of whether the American public is kept in the dark. Continuing the analogy to the Manhattan Project: They succeeded in keeping it secret from Congress, but failed at keeping it secret from the USSR.
What do you think about the concern of a US company speeding up other US companies / speeding up open source models? (I don’t expect US companies to spy on each other as well as the USSR did during the Manhattan project) Do you expect competition between US companies to not be relevant?
I expect that the transparency measures you suggest (releasing the most relevant internal metrics + model access + letting employees talk about what internal deployments look like) leak a large number of bits that speed up other actors a large amount (via pointing at the right research direction + helping motivate and secure very large investments + distillation). Maybe a crux here is maybe how big the speedup is?
“We’ll give at least ten thousand external researchers (e.g. academics) API access to all models that we are still using internally, heavily monitored of course, for the purpose of red teaming and alignment research”
How do you expect to deal with misuse worries? Do you just eat the risk? I think the fact that AI labs are not sharing helpful-only models with academics is not a very reassuring precedent here.
I agree some form of obfuscation is plausible in these scenarios and Deepseek was trained against a strong CoT monitor + outcome-based feedback that would have been more interesting to study than math.
My intuition for why Math is still interesting: AI companies (e.g. Deepseek / OpenAI) will do A LOT of RL on math / programming. How likely encoded reasoning is to emerge is a function of how much RL you do on a given task, and how strong incentives for encoded reasoning are. Currently people don’t seem to be doing much optimization against CoT in big reasoning RL, but they are doing lots of RL in settings where encoded reasoning may have been incentivized because English might be inefficient (in length and expressiveness). Therefore, the big reasoning RL tasks like math are where encoded reasoning is among the most likely tasks to be observed if models learned encoded reasoning during RL.
The length of Claude 3.7 Sonnet scratchpads can be controlled, and it was also the case for the early snapshot I used in these experiments, though all scratchpads used here had a relatively high thinking budget. I think the experiments here mostly rule out the position some people had that there would be a lot of hidden reasoning even with relatively mild length control because English is very inefficient compare to steganography.
I agree it would be interesting to run the same experiments on scratchpads with a smaller thinking budget.
I agree paraphrasing using Claude 3.5 Sonnet is not ideal, and I’d be curious to see what results would be when paraphrasing with other models. Just noting that you need to make sure the models are smart enough to agree to paraphrase long text (it was very hard to get Claude 3.5 Haiku to actually do that, which is why I gave up and focused on Claude 3.5 Sonnet paraphrases). I am not very worried: I’d be very surprised if Claude 3.5 Sonnet was able to learn/recognize and keep the encoded reasoning Claude 3.7 Sonnet uses. But I am curious how the exact paraphrasing scheme used (both the model and the prompt) influences performance.
Jack Clark in the Import AI newsletter:
AI 2027 tells you why people are obsessed with and fearful of AI:
…The best treatment yet of what ‘living in an exponential’ might look like…
There’s been so much chatter about AI 2027 that I expect the majority of readers of Import AI have read it. If you haven’t, please do—it’s a well rendered technically-astute narrative of the next few years of AI development and paints a picture of how today’s AI systems might turn into superintelligences that upend the order of the world. It even features a ‘good’ ending and a ‘bad’ ending, which readers may find helpful for understanding why people worry so much about misaligned AIs. It’s very, very good, and likely much of it will come true.
Why this matters—the future of the species is being decided without governments: What comes through in the 2027 scenario is the almost total absence of governments as the heart of the invention of the technology, instead their roles mostly come down to a) building infrastructure to support company-led efforts, b) adding security, and c) eventually negotiating treaties with one another.
This makes sense but it’s also deeply worrying. The magnitude of the technology being described in 2027 is so vast and its impacts so broad that the bit-player status of governments seems ahistorical and bizarre, as if part of a cosmic joke where someone has booted up the simulator with the prompt “give the species the most dangerous and least governed way to build species-transcending technology”.
I tried again with Mixtral 8x7B (base) and did not get meaningful gaps between 0-shot, few-shot and fancy prompts (no matter the list size, with n=100, all accuracies are within error bars of each others, with the expected trend of more shot is better if you squint). Maybe past a certain size, models are well elicited by default on these sorts of problems (with Mistral you need lists of size ~40 to get an accuracy below 100%).
More detailed results on davinci-002 with 2-sigma error bars (n=200):
I agree. As I tweeted here, I also think it’s also a somewhat positive update about having faithfulness on at least some inputs (r1 is faithful at least 9% of the time across all 6 settings vs 4⁄6 for v3!).
The abstract of the paper has a vibe/content which matches my epistemic state:
These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.
I think inhuman but not superhuman is fine as long as it holds that succeeding at defending against human attack is strong evidence of success against inhuman ones. If the defenses are not “overfit” to human attacks, I think it’s quite doable to get a sense of how good defenses are against inhuman attacks. For paraphrasing in particular, I’d be very surprised if the best human attacks (found in many days of research) were defeated by paraphrasing but human-level-ish LLMs (given only a few minutes of human-equivalent opaque thinking time) could figure out strategies which were not defeated by paraphrasing.
(And there is also the strategy of using RL to elicit inhuman attacks which doesn’t have the weakness you describe, but which has other issues.)
Does this architecture mean that you can do zero scaffold development and testing without pushing code to master? What are modifications to this system architecture that enable scaffold research?
I am not imagining the human red-team coming up with a long list of attack strategies and then collaborating with the blue team to patch each of them. I am imagining sth more like the blue team having a fairly general strategy (e.g. paraphrasing), and then relying on a property like “the part of solution space the human red-team can cover is roughly as effective at beating the blue-team protocol as the AI solutions”. I agree that AIs having different kinds of ideas is a reason to expect this assumption to not hold for some protocols (e.g. for human monitoring), but as discussed here, the human red-team also has a bunch of affordances that make it plausible to get somewhat conservative evals. This seems especially true for things like steganography vs paraphrasing, where I don’t see strong reasons to expect AIs to excel relative to humans.
I guess a lot of this is contingent on what AGI will look like. Current AIs have mostly human-like insights. If AIs got very insane high-quality ideas (e.g. like move 37) in the relevant domains as we got closer to AGI, I think this would become more concerning.
There are other hopes like “train AIs to be good at collusion / steganography, and use that as red-team” that suffer less from the concerns discussed here, but they have other problems like exploration hacking.
I am not sure how much we disagree. For these sorts of difficulties both in ruling out collusion and in evaluating it, I expect that doing untrusted monitoring well enough to have it reduce risk significantly would likely require a large amount of effort.
This is a mesa-optimizer in a weak sense of the word: it does some search/optimization. I think the model in the paper here is weakly mesa-optimizing, maybe more than base models generating random pieces of sports news, and maybe roughly as much as a model trying to follow weird and detailed instructions—except that here it follows memorized “instructions” as opposed to in-context ones.
Cool work!
There are 2 findings about that I found surprising and that I’d be interested in seeing explored through other methods:
Some LLMs are computing the last digit in addition mostly in base 10 (even when all numbers are a single token?)
LLMs trained to not hallucinate sometimes decide whether to give an answer or not based on the familiarity of the entity, rather than based on the familiarity of the actual answer (they don’t try to first recover the answer and then refuse to answer if they only recall a low-confidence guess?)
The second one may imply that LLMs are less able to reason about what they are about to say than I thought.
I also find it cool that you measured how good the explanations for your new features are. I find it slightly concerning how bad the numbers are. In particular, I would have expected a sort eval error much below 2% (which is the sort eval error you would get if you perfectly assigned each dataset example to one of
5 balanced categories of features[Edit: My math was wrong. 2% is what you get with 25 categories]), but you find a sort eval error around 10%. Some of that is probably Claude being dumb, but I guess you would also struggle to get below 2% with human labels?But I also see how very predictive feature explanations might not be necessary. I am looking forward to seeing how Circuit Tracing performs in cases where there are more sources of external validations (e.g. hard auditing games)!
This prevents predatory strategies from masquerading as altruism
I did not understand that. Is the worry that it’s hard to distinguish a genuine welfare maximizer from a predator because you can’t tell if they will ever give you back power? I don’t understand why this does not apply to agents pretending to pursue empowerment. It is common in conflicts to temporarily disempower someone to protect their long-term empowerment (e.g. a country mandatorily mobilizing for war against a fascist attacker, preventing a child from ignoring their homework), and it is also common to pretend to protect long-term empowerment and never give back power (e.g. a dictatorship of the proletariat never transitioning to a “true” communist economy).
I think this would have been plausible if distillation of the original scratchpads resulted in significantly worse performance compared to the model I distilled from. But the gap is small, so I think we can exclude this possibility.
They were separated by spaces. (But I’d encourage replication before updating too hard on results which I think are very weird.)
Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought.
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it’s likely that GPT-4o will too. See the discussion in another comment thread about “chicken-and-egg” problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using “freestyling words” frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don’t understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).
I’ll add another one: in the training against the CoT monitor experiment, the model does not get a large constant negative reward for being caught (which is what you might naively do and which should intuitively work best). Instead, the reward is 1 if the tests pass and the CoT monitor says it is fine, and otherwise it is 0. This means the same reward of 0 for benign failures and attempts at cheating which get caught.
This means that for problems where the model knows it is extremely unlikely to succeed (e.g. when it tried a bunch and repeatedly failed), the model is not penalized for cheating, which means it should still cheat even if it has a high chance (e.g. 90%) of being caught by the CoT monitor.
With a different choice of reward where the model is penalized very hard for cheating even if it would have otherwise failed, it may cheat much less, and never explore into “cheat while bypassing the monitor”.
Slightly late update: I am in the middle of projects which would be hard to do outside Anthropic. I will push them for at least 2-4 more months before reconsidering my decision and writing up my views.