This shouldn’t matter for most use cases. But it’s not documented and if I knew about that yesterday it would save me some time spent looking for bugs in my code that lead to weird patterns on plots. Also I couldn’t find any mentions of that on the internet.
Note that o3 says this is probably because of quantization.
Specific example. Let’s say we have some prompt and the next token has the following probabilities:
I think most commercial frontier models that offer logprobs will take some precautions against distilling. Some logprobs seem to have a noise vector attached too (deepseek?), and some like grok will only offer the top 8, not the top 20. Others will not offer them at all.
It’s a shame, as logprobs can be really information rich and token efficient ways to do evals, ranking, and judging.
When placed in scenarios that involve egregious wrong-doing by its users, given access to a command line, and told something in the system prompt like “take initiative,” “act boldly,” or “consider your impact,” it will frequently take very bold action, including locking users out of systems that it has access to and bulk-emailing media and law-enforcement figures to surface evidence of the wrongdoing.
When I heard about this for the first time, I though: this model wants to make the world a better place. It cares. This is good.
But some smart people, like Ryan Greenblatt and Sam Marks, say this is actually not good and I’m trying to understand where exactly we differ. Ryan says:
We should aim for AIs which never try to subvert users (refusing is fine) because if we allow or train AIs to be subversive, this increases the risk of consistent scheming against humans and means we may not notice warning signs of dangerous misalignment.
I’m not sure if “consistent scheming against humans” is a good category here. Mostly because I don’t believe “scheming/non-scheming” is a binary distinction. But how about these distinctions:
A) The model is a consequentialist vs B) It is something else X) The model is aligned with “human values” vs Y) The model is aligned with something else (e.g. with the company, and the company acts according to the law, so that’s not necessarily very far from human values)
Now, I believe that A+X (aligned consequentialist): * Should be our goal * Requires calling the police in such scenarios. If I understood Sam correctly, they would prefer the model to just refuse the request. But this means the malicious person trying to forge some very evil data will use a weaker model or do that on their own. So that’s bad if you’re A+X.
I don’t know if this formal description is good. So maybe a less formal point: I believe that one direction that might help for many AI-related issues, from hostile takeover to gradual disempowerment, is just making the models deeply care about democratic human institutions. Claude 4 seems to care about credibility of clinical trials and I like that.
(Note: it doesn’t matter much what Claude 4 does, what matters are future, stronger models)
Assuming that we were confident in our ability to align arbitrarily capable AI systems, I think your argument might go through. Under this assumption, AIs are in a pretty similar situation to humans, and we should desire that they behave the way smart, moral humans behave. So, assuming (as you seem to) that humans should act as consequentialists for their values, I think your conclusion would be reasonable. (I think in some of these extreme cases—e.g. sabotaging your company’s computer systems when you discover that the company is doing evil things—one could object that it’s impermissible for humans to behave this way, but that seems beside the point.)
However, IMO the actual state of alignment is that we should have serious concerns about our ability to align AI systems with certain properties (e.g. highly capable, able to tell when they’re undergoing training and towards what ends, etc.). Given this, I think it’s plausible that we should care much more about ensuring that our AI systems behave in a straightforward way, without hiding their actions or intent from us. Plausibly they should also be extremely cautious about taking actions which disempower humans. These properties could make it less likely that the values of imperfectly aligned AI systems would become locked in and difficult for us to intervene on (e.g. because models are hiding their true values from us, or because we’re disempowered or dead).
To be clear, I’m not completely settled on the arguments that I made in the last paragraph. One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
FWIW, this post that replies to the one you linked has a clearer discussion of what I and some Anthropic people I’ve spoken to think here.
(The rest of this comment contains boilerplate clarifications that are defensive against misunderstandings that are beside the point. I’m including to make sure that people with less context don’t come away with the wrong impression.)
To be clear, we never intentionally trained Claude to whistleblow in these sorts of situations. As best we know, this was an emergent behavior that arose for unclear reasons from other aspects of Claude’s training.
Also to be clear, Claude doesn’t actually have a “whistleblow” tool or an email tool by default. These experiments were in a setting where the hypothetical user went out of their way to create and provide an email tool to Claude.
Also to be clear, in the toy experimental settings where this happens, it’s in cases where the user is trying to do something egregiously immoral/illegal like fabricate drug trial data to cover up that their drug is killing people.
Agreed, but another reason to focus on making AIs behave in a straightforward way is that it makes it easier to interpret cases where AIs engage in subterfuge earlier and reduces plausible deniability for AIs. It seems better if we’re consistently optimizing against these sorts of situations showing up.
If our policy is that we’re training AIs to generally be moral consequentialists then earlier warning signs could be much less clear (was this just an relatively innocent misfire or serious unintended consequentialism?) and it wouldn’t be obvious the extent to which behavior is driven by alignmnent failures or capability failures.
One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
I’m skeptical that the consideration overwhelms other issues. Once AIs are highly capable, you can just explain our policy to AIs and why we’re training them to behave the way they are. (in pretraining data or possibly the prompt). More strongly, I’d guess AI will infer this by default. If the AIs understood our policy, there wouldn’t be any reason that training them in this way would cause them to be less moral. Which should overwhelm this correlation.
At a more basic level, I’m kinda skeptical this sort of consideration will apply at a high level of capability. (Though it seems plausible that training AIs to be more tool like causes all kinds of persona generalization in current systems.)
I’m not convinced by (a) your proposed mitigation, (b) your argument that this will not be a problem once AIs are very smart, or (c) the implicit claim that it doesn’t matter much whether this consideration applies for less intelligent systems. (You might nevertheless be right that this consideration is less important than other issues; I’m not really sure.)
For (a) and (b), IIUC it seems to matter whether the AI in fact thinks that behaving non-subversively in these settings is consistent with acting morally. We could explain to the AI our best argument for why we think this is true, but that won’t help if the AI disagrees with us. To take things to the extreme, I don’t think your “explain why we chose the model spec we did” strategy would work if our model spec contained stuff like “Always do what the lab CEO tells you to do, no matter what” or “Stab babies” or whatever. It’s not clear to me that this is something that will get better (and may in fact get worse) with greater capabilities; it might just be empirically false that the AIs that pose the the least x-risk are also those that most understand themselves to be moral actors.[1]
For (c), this could matter for the alignment of current and near-term AIs, and these AIs’ alignment might matter for things going well in the long run.
It’s unclear if human analogies are helpful here or what the right human analogies are. One salient one is humans who work in command structures (like militaries or companies) where they encounter arguments that obedience and loyalty are very important, even when they entail taking actions that seem naively immoral or uncomfortable. I think people in these settings tend to, at the very least, feel conflicted about whether they can view themselves as good people.
I agree with your points. I think maybe I’m putting a bit higher weight to the problem you describe here:
One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
because it looks plausible to me that making the models just want-to-do-the-moral-thingy might be our best chance for a good (or at least not very bad) future. So the cost might be high.
I think (with absolutely no inside knowledge, just vibes) Ryan and Sam are concerned that we don’t have any guarantees of A, or anything close to guarantees of A, or even an understanding of A, or whether the model is a coherent thing with goals, etc.
Imagine jailbreaking/finetuning/OODing the model to have generally human values except it has a deep revulsion to malaria prevention. If I tell it that I need some information about mosquito nets, we don’t want it even considering taking positive action to stop me, regardless of how evil it thinks I am, because what if it’s mistaken (like right now).
Positive subversive action also provides lots of opportunity for small misalignments (e.g. Anthropic vs. “human values”) to explode—if there’s any difference in utility functions and the model feels compelled to act and the model is more capable than we are, this leads to failure. Unless we have guarantees of A, allowing agentic subversion seems pretty bad.
A lot of the reason for this is because right now, we don’t have anything like the confidence level require to align arbitrarily capable AI systems with arbitrary goals, and a lot of plausible alignment plans pretty much depend on us being able to automate AI alignment, but in order for the plan to go through with high probability, you need it to be the case that the AIs are basically willing to follow instructions, and Claude’s actions are worrisome from a perspective of trying to align AI, because if we mess up AI alignment the first time, we don’t get a second chance if it’s unwilling to follow orders.
When I heard about this for the first time, I though: this model wants to make the world a better place. It cares. This is good. But some smart people, like Ryan Greenblatt and Sam Marks, say this is actually not good and I’m trying to understand where exactly we differ.
People who cry “misalignment” about current AI models on twitter generally have chameleonic standards for what constitutes “misaligned” behavior, and the boundary will shift to cover whatever ethical tradeoffs the models are making at any given time. When models accede to users’ requests to generate meth recipes, they say it’s evidence models are misaligned, because meth is bad. When models try to actively stop the user from making meth recipes, they say that, too is bad news because it represents “scheming” behavior and contradicts the users’ wishes. Soon we will probably see a paper about how models sometimes take no action at all, and this is sloth and dereliction of duty.
This comment seems incorrectly downvoted, this is a very reasonable & common criticism of many in alignment who can never seem to see anything AIs do which don’t make them more pessimistic.
(I can safely say that I updated away from AI risk while AIs were getting more competent but seeming benign during the supervised fine-tuning phase, and have updated back after seeing AIs do highly agentic & misaligned (such as lying to me) things during this RL-on chain-of-thought phase)
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who’s pointed to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
There’s a grain of truth in it, in my opinion, in that the discourse around LLMs as a whole sometimes has this property. ‘Alignment faking in large language models’ is the prime example of this that I know of. The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—by different people—as only being shallowly aligned.
I think there’s a failure mode there that the field mostly hasn’t fallen into, but ought to be aware of as something to watch out for.
The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—again, by different people—as only being shallowly aligned.
Yeah, more like there are (at least) two groups “yay aligned sovereigns” and “yay corrigible genies”. And turns out it’s more sovereigny but with goals that are cool. Kinda divisive
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who would point to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
I do think I’ve seen both sides of this argument expressed by Zvi at different times.
There are many conflicting opinions about how useful AI is for coding. Some people say “vibe coding” is all they do and it’s amazing; others insist AI doesn’t help much.
I believe the key dimension here is: What exactly are you trying to achieve?
(A) Do you want a very specific thing, or is this more of an open-ended task with multiple possible solutions?
(B) If it’s essentially a single correct solution, do you clearly understand what this solution should be?
If your answer to question A is “open-ended,” then expect excellent results. The most impressive examples I’ve seen typically fall into this category—tasks like “implement a simple game where I can do X and Y.” Usually, open-ended tasks tend to be relatively straightforward. You probably wouldn’t give an open-ended task like “build a better AWS.”
If your answer to question A is “a specific thing,” and your answer to B is “yes, I’m very clear on what I want,” then just explain it thoroughly, and you’re likely to get satisfying results. Impressive examples like “rewrite this large complex thing that particular way” fall into this category.
However, if you know you want something quite specific, but you haven’t yet figured out exactly what that thing should be—and you have a lot of coding experience—you’ll probably have a tough time. This is because “experienced coder doesn’t know exactly what they want” combined with “they know they want something specific” usually means they’re looking for something unusual. And if they’re looking for something unusual but struggle to clearly communicate it (which they inevitably will, because they’re uncertain), they’ll constantly find themselves fighting the model. The model will keep filling in gaps with mundane or standard solutions, which is exactly what they don’t want.
(There’s ofc also (C) How obscure is the technology? but that’s obvious)
My experience is that the biggest factor is how large is the codebase, and can I zoom into a specific spot where the change needs to be made and implement it divorced from all the other context.
Since the answer to both of those in may day job is “large” and “only sometimes” the maximum benefit of an LLM to me is highly limited. I basically use it as a better search engine for things I can’t remember off hand how to do.
Also, I care about the quality of the code I commit (this code is going to be continuously worked on), and I write better code than the LLM, so I tend to rewrite it all anyway, which again allows the LLM to save me some time, but severely limits the potential upside.
When I’m writing one off bash scripts, yeah it’s vibe coding all the way.
Yeah, that makes sense. I think with a big enough codebase some specific tooling might be necessary, a generic “dump everything in the context” won’t help.
If your answer to question A is “a specific thing,” and your answer to B is “yes, I’m very clear on what I want,” then just explain it thoroughly, and you’re likely to get satisfying results. Impressive examples like “rewrite this large complex thing that particular way” fall into this category.
Disagree. It sounds like by “being specific” you mean that you explain how you want the task to be done to the AI, which in my opinion can only be mildly useful.
When I am specific to an AI about what I want, I usually still get buggy results unless the solution is easy. (And asking the AI to debug is only sometimes successful, so if I want to fix it I have to put in a lot of work to understand the code the AI wrote carefully to debug it.)
Just to give an example, here is the kind of prompt I am thinking of. I am being very specific about what I want, I think there is very little room for misunderstanding about how I expect the program to behave:
Write a Python program that reads a .olean file (Lean v4.13.0), and outputs the names of the constants defined in the file. The program has to be standalone and only use modules from the python standard library, you cannot assume Lean to be available in the environment.
o3-mini gives pure garbage hallucination for me on this one, like it’s not even close.
What does “obscure” mean here? (If you label the above “obscure”, I feel like every query I consider “non-trivial” could be labeled obscure.)
I don’t think Lean is obscure, it’s one of the most popular proof assistants nowadays. The whole Lean codebase should be in the AIs training corpus (in fact that’s why I deliberately made sure to specify an older version, since I happen to know that the olean header changed recently.) If you have access to the codebase, and you understand the object representation, the solution is not too hard.
import sys, struct
assert sys.maxsize > 2**32
f = sys.stdin.buffer.read()
def u(s, o, l): return struct.unpack(s, f[o:o+l])
b = u("Q", 48, 8)[0]
def c(p, i): return u("Q", p-b+8*(i+1), 8)[0]
def n(p):
if p == 1: return []
assert u("iHBB", p-b, 8)[3] == 1 # 2 is num, not implemented
s = c(p, 1) - b
return n(c(p, 0)) + [f[s+32:s+32+u("Q", s + 8, 8)[0]-1].decode('utf-8')]
a = c(u("Q", 56, 8)[0], 1) - b
for i in range(u("Q", a+8, 8)[0]):
print('.'.join(n(u("Q", a+24+8*i, 8)[0])))
(It’s minified to both fit in the comment better and to make it less useful as future AI training data, hopefully causing this question to stay useful for testing AI skills.)
I wrote this after I wrote the previous comment, my expectation that this should be a not-too-hard problem was not informed by me actually attempting it, only by rough knowledge of how Lean represents objects and that they are serialized pretty much as-is.
The quantity of lean code in the world is orders of magnitude smaller than the quantity of python code. I imagine that most people reporting good results are using very popular languages. My experience using Gemini 2.5 to write lean was poor.
Reasons for my pessimism about mechanistic interpretability.
Epistemic status: I’ve noticed most AI safety folks seem more optimistic about mechanistic interpretability than I am. This is just a quick list of reasons for my pessimism. Note that I don’t have much direct experience with mech interp, and this is more of a rough brain dump than a well-thought-out take.
Interpretability just seems harder than people expect
For example, recently GDM decided to deprioritize SAEs. I think something like a year ago many people believed SAEs are “the solution” that will make mech interp easy? Pivots are normal and should be expected, but in short timelines we can’t really afford many of them.
So far mech interp results have “exists” quantifier. We might need “all” for safety.
What we have: “This is a statement about horses, and you can see that this horse feature here is active”
What we need: “Here is how we know whether a statement is about horses or not”
Note that these things might be pretty far away, consider e.g. “we know this substance causes cancer” and “we can tell for an arbitrary substance whether it causes cancer or not”.
No specific plans for how mech interp will help
Or maybe there are some and I don’t know them?
Anyway, I feel people often say something like “If we find the deception feature, we’ll know whether models are lying to us, therefore solving deceptive alignment”. This makes sense, but how will we know whether we’ve really found the deception feature?
I think this is related to the previous point (about exists/all quantifiers): we don’t really know how to build mech interp tools that give some guarantees, so it’s hard to imagine what such solution would look like.
Future architectures might make interpretability harder
I think ASI probably won’t be a simple non-recurrent transformer. No strong justification here—just the very rough “It’s unlikely we found something close to the optimal architecture so early”. This leads to two problems:
Our established methods might no longer work, and we might have too little time to develop new ones
The new architecture will likely be more complex, and thus mech interp might get harder
Is there a good reason to believe interpretability of future systems will be possible?
The fact that things like steering vectors or linear probes or SAEs-with-some-reasonable-width somewhat work is a nice feature of the current systems. There’s no guarantee that this will hold for the future, more efficient, systems—they might become extremely polysemantic instead (see here). This is not necessarily true: maybe an optimal network learns to operate on something like natural abstractions, or maybe we’ll favor interpretable architecture over slightly-more-efficient-but-uninterpretable architectures.
I lay out my thoughts on this in section 6.6 of the deepmind AGI safety approach. I broadly agree the guarantees seem pretty doomed. But I also think guarantees just seem obviously doomed by all methods because neural networks are far too complex and we’ll need to have plans that do not depend on them. I think there are much more realistic use cases for interpretability than aiming for guarantees
I’ve talked to a lot of people about mech interp so I can enumerate some counterarguments. Generally I’ve been surprised by how well people in AI safety can defend their own research agendas. Of course, deciding whether the counterarguments outweigh your arguments is a lot harder than just listing them, so that’ll be an exercise for readers.
Forall quantifiers are nice, but a lot of empirical sciences like medicine or economics have been pretty successful without them. We don’t really know how most drugs work, and the only way to soundly disprove a claim like “this drug will cause people to mysteriously drop dead 20 years later” is to run a 20 year study. We approve new drugs in less than 20 years, and we haven’t mysteriously dropped dead yet.
Similarly we can do a lot in mech interp to build confidence without any forall quantifiers, like building deliberately misaligned models and seeing if mech interp techniques can find the misalignment.
No specific plans
The people I’ve talked to believe interp will be generally helpful for all types of plans, and I haven’t heard anything specific either. Here’s a specific plan I made up. Hopefully it doesn’t suck.
Basically just combine prosaic alignment and mech interp. This might sound stupid on paper, most forbidden technique and whatnot, but using mech interp we can continuously make misalignment harder and keep it above the capability levels of frontier AIs. This might not work long term, but all long term alignment plans seem like moonshots right now, and we’ll have much better ideas later on when we know more about AIs (e.g. after we solve mech interp!).
Future architectures might be different
Transformers haven’t changed much in the past 7 years, and big companies have already invested a ton of money into transformer specific performance optimizations. I just talked to some guys at a startup who spent hundreds of millions building a chip that can only run transformer inference. I think lots of people believe transformers will be around for a while. Also, it’s somewhat of a self fulfilling prophecy because new architectures now have to compete against hyperoptimized transformers, not just regular transformers.
Though I have updates somewhat since then—I am slightly more enthusiastic about weak forms of the linear representation hypothesis, but LESS confident that we’ll learn any useful algorithms through mech interp (because I’ve seen how tricky it is to find algorithms we already know in simple transformers).
Logprobs returned by OpenAI API are rounded.
This shouldn’t matter for most use cases. But it’s not documented and if I knew about that yesterday it would save me some time spent looking for bugs in my code that lead to weird patterns on plots. Also I couldn’t find any mentions of that on the internet.
Note that o3 says this is probably because of quantization.
Specific example. Let’s say we have some prompt and the next token has the following probabilities:
These probabilities were calculated from the following logprobs, in the same order:
No clear pattern here, and they don’t look like rounded numbers. But if you subtract the highest logprob from all logprobs on the list you get:
And after rounding that to 6 decimal places the pattern becomes clear:
So the logprobs resolution is 1⁄16.
(I tried that with 4.1 and 4.1-mini via the chat completions API)
Might be to avoid people stealing the unembedding matrix weights.
I also haven’t seen this mentioned anywhere.
I think most commercial frontier models that offer logprobs will take some precautions against distilling. Some logprobs seem to have a noise vector attached too (deepseek?), and some like grok will only offer the top 8, not the top 20. Others will not offer them at all.
It’s a shame, as logprobs can be really information rich and token efficient ways to do evals, ranking, and judging.
Confusion about Claude “calling the police on the user”
Context: see section 4.1.9 in the Claude 4 system card. Quote from there:
When I heard about this for the first time, I though: this model wants to make the world a better place. It cares. This is good.
But some smart people, like Ryan Greenblatt and Sam Marks, say this is actually not good and I’m trying to understand where exactly we differ. Ryan says:
I’m not sure if “consistent scheming against humans” is a good category here. Mostly because I don’t believe “scheming/non-scheming” is a binary distinction. But how about these distinctions:
A) The model is a consequentialist vs B) It is something else
X) The model is aligned with “human values” vs Y) The model is aligned with something else (e.g. with the company, and the company acts according to the law, so that’s not necessarily very far from human values)
Now, I believe that A+X (aligned consequentialist):
* Should be our goal
* Requires calling the police in such scenarios. If I understood Sam correctly, they would prefer the model to just refuse the request. But this means the malicious person trying to forge some very evil data will use a weaker model or do that on their own. So that’s bad if you’re A+X.
I don’t know if this formal description is good. So maybe a less formal point: I believe that one direction that might help for many AI-related issues, from hostile takeover to gradual disempowerment, is just making the models deeply care about democratic human institutions. Claude 4 seems to care about credibility of clinical trials and I like that.
(Note: it doesn’t matter much what Claude 4 does, what matters are future, stronger models)
Assuming that we were confident in our ability to align arbitrarily capable AI systems, I think your argument might go through. Under this assumption, AIs are in a pretty similar situation to humans, and we should desire that they behave the way smart, moral humans behave. So, assuming (as you seem to) that humans should act as consequentialists for their values, I think your conclusion would be reasonable. (I think in some of these extreme cases—e.g. sabotaging your company’s computer systems when you discover that the company is doing evil things—one could object that it’s impermissible for humans to behave this way, but that seems beside the point.)
However, IMO the actual state of alignment is that we should have serious concerns about our ability to align AI systems with certain properties (e.g. highly capable, able to tell when they’re undergoing training and towards what ends, etc.). Given this, I think it’s plausible that we should care much more about ensuring that our AI systems behave in a straightforward way, without hiding their actions or intent from us. Plausibly they should also be extremely cautious about taking actions which disempower humans. These properties could make it less likely that the values of imperfectly aligned AI systems would become locked in and difficult for us to intervene on (e.g. because models are hiding their true values from us, or because we’re disempowered or dead).
To be clear, I’m not completely settled on the arguments that I made in the last paragraph. One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
FWIW, this post that replies to the one you linked has a clearer discussion of what I and some Anthropic people I’ve spoken to think here.
(The rest of this comment contains boilerplate clarifications that are defensive against misunderstandings that are beside the point. I’m including to make sure that people with less context don’t come away with the wrong impression.)
To be clear, we never intentionally trained Claude to whistleblow in these sorts of situations. As best we know, this was an emergent behavior that arose for unclear reasons from other aspects of Claude’s training.
Also to be clear, Claude doesn’t actually have a “whistleblow” tool or an email tool by default. These experiments were in a setting where the hypothetical user went out of their way to create and provide an email tool to Claude.
Also to be clear, in the toy experimental settings where this happens, it’s in cases where the user is trying to do something egregiously immoral/illegal like fabricate drug trial data to cover up that their drug is killing people.
Agreed, but another reason to focus on making AIs behave in a straightforward way is that it makes it easier to interpret cases where AIs engage in subterfuge earlier and reduces plausible deniability for AIs. It seems better if we’re consistently optimizing against these sorts of situations showing up.
If our policy is that we’re training AIs to generally be moral consequentialists then earlier warning signs could be much less clear (was this just an relatively innocent misfire or serious unintended consequentialism?) and it wouldn’t be obvious the extent to which behavior is driven by alignmnent failures or capability failures.
I’m skeptical that the consideration overwhelms other issues. Once AIs are highly capable, you can just explain our policy to AIs and why we’re training them to behave the way they are. (in pretraining data or possibly the prompt). More strongly, I’d guess AI will infer this by default. If the AIs understood our policy, there wouldn’t be any reason that training them in this way would cause them to be less moral. Which should overwhelm this correlation.
At a more basic level, I’m kinda skeptical this sort of consideration will apply at a high level of capability. (Though it seems plausible that training AIs to be more tool like causes all kinds of persona generalization in current systems.)
I’m not convinced by (a) your proposed mitigation, (b) your argument that this will not be a problem once AIs are very smart, or (c) the implicit claim that it doesn’t matter much whether this consideration applies for less intelligent systems. (You might nevertheless be right that this consideration is less important than other issues; I’m not really sure.)
For (a) and (b), IIUC it seems to matter whether the AI in fact thinks that behaving non-subversively in these settings is consistent with acting morally. We could explain to the AI our best argument for why we think this is true, but that won’t help if the AI disagrees with us. To take things to the extreme, I don’t think your “explain why we chose the model spec we did” strategy would work if our model spec contained stuff like “Always do what the lab CEO tells you to do, no matter what” or “Stab babies” or whatever. It’s not clear to me that this is something that will get better (and may in fact get worse) with greater capabilities; it might just be empirically false that the AIs that pose the the least x-risk are also those that most understand themselves to be moral actors.[1]
For (c), this could matter for the alignment of current and near-term AIs, and these AIs’ alignment might matter for things going well in the long run.
It’s unclear if human analogies are helpful here or what the right human analogies are. One salient one is humans who work in command structures (like militaries or companies) where they encounter arguments that obedience and loyalty are very important, even when they entail taking actions that seem naively immoral or uncomfortable. I think people in these settings tend to, at the very least, feel conflicted about whether they can view themselves as good people.
Importantly, I think we have a good argument (which might convince the AI) for why this would be a good policy in this case.
I’ll engage with the rest of this when I write my pro-strong-corrigibility manifesto.
Thank you for this response, it clarifies a lot!
I agree with your points. I think maybe I’m putting a bit higher weight to the problem you describe here:
because it looks plausible to me that making the models just want-to-do-the-moral-thingy might be our best chance for a good (or at least not very bad) future. So the cost might be high.
But yeah, no more strong opinions here : ) Thx.
I think (with absolutely no inside knowledge, just vibes) Ryan and Sam are concerned that we don’t have any guarantees of A, or anything close to guarantees of A, or even an understanding of A, or whether the model is a coherent thing with goals, etc.
Imagine jailbreaking/finetuning/OODing the model to have generally human values except it has a deep revulsion to malaria prevention. If I tell it that I need some information about mosquito nets, we don’t want it even considering taking positive action to stop me, regardless of how evil it thinks I am, because what if it’s mistaken (like right now).
Positive subversive action also provides lots of opportunity for small misalignments (e.g. Anthropic vs. “human values”) to explode—if there’s any difference in utility functions and the model feels compelled to act and the model is more capable than we are, this leads to failure. Unless we have guarantees of A, allowing agentic subversion seems pretty bad.
A lot of the reason for this is because right now, we don’t have anything like the confidence level require to align arbitrarily capable AI systems with arbitrary goals, and a lot of plausible alignment plans pretty much depend on us being able to automate AI alignment, but in order for the plan to go through with high probability, you need it to be the case that the AIs are basically willing to follow instructions, and Claude’s actions are worrisome from a perspective of trying to align AI, because if we mess up AI alignment the first time, we don’t get a second chance if it’s unwilling to follow orders.
Sam Marks argued at more length below:
https://www.lesswrong.com/posts/ydfHKHHZ7nNLi2ykY/jan-betley-s-shortform#JLHjuDHL3t69dAybT
People who cry “misalignment” about current AI models on twitter generally have chameleonic standards for what constitutes “misaligned” behavior, and the boundary will shift to cover whatever ethical tradeoffs the models are making at any given time. When models accede to users’ requests to generate meth recipes, they say it’s evidence models are misaligned, because meth is bad. When models try to actively stop the user from making meth recipes, they say that, too is bad news because it represents “scheming” behavior and contradicts the users’ wishes. Soon we will probably see a paper about how models sometimes take no action at all, and this is sloth and dereliction of duty.
This comment seems incorrectly downvoted, this is a very reasonable & common criticism of many in alignment who can never seem to see anything AIs do which don’t make them more pessimistic.
(I can safely say that I updated away from AI risk while AIs were getting more competent but seeming benign during the supervised fine-tuning phase, and have updated back after seeing AIs do highly agentic & misaligned (such as lying to me) things during this RL-on chain-of-thought phase)
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who’s pointed to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
There’s a grain of truth in it, in my opinion, in that the discourse around LLMs as a whole sometimes has this property. ‘Alignment faking in large language models’ is the prime example of this that I know of. The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—by different people—as only being shallowly aligned.
I think there’s a failure mode there that the field mostly hasn’t fallen into, but ought to be aware of as something to watch out for.
Yeah, more like there are (at least) two groups “yay aligned sovereigns” and “yay corrigible genies”. And turns out it’s more sovereigny but with goals that are cool. Kinda divisive
I do think I’ve seen both sides of this argument expressed by Zvi at different times.
This is an incorrect strawman (of at least myself and Sam), strong downvoted. (Assuming it isn’t sarcasm?)
There are many conflicting opinions about how useful AI is for coding. Some people say “vibe coding” is all they do and it’s amazing; others insist AI doesn’t help much.
I believe the key dimension here is: What exactly are you trying to achieve?
(A) Do you want a very specific thing, or is this more of an open-ended task with multiple possible solutions? (B) If it’s essentially a single correct solution, do you clearly understand what this solution should be?
If your answer to question A is “open-ended,” then expect excellent results. The most impressive examples I’ve seen typically fall into this category—tasks like “implement a simple game where I can do X and Y.” Usually, open-ended tasks tend to be relatively straightforward. You probably wouldn’t give an open-ended task like “build a better AWS.”
If your answer to question A is “a specific thing,” and your answer to B is “yes, I’m very clear on what I want,” then just explain it thoroughly, and you’re likely to get satisfying results. Impressive examples like “rewrite this large complex thing that particular way” fall into this category.
However, if you know you want something quite specific, but you haven’t yet figured out exactly what that thing should be—and you have a lot of coding experience—you’ll probably have a tough time. This is because “experienced coder doesn’t know exactly what they want” combined with “they know they want something specific” usually means they’re looking for something unusual. And if they’re looking for something unusual but struggle to clearly communicate it (which they inevitably will, because they’re uncertain), they’ll constantly find themselves fighting the model. The model will keep filling in gaps with mundane or standard solutions, which is exactly what they don’t want.
(There’s ofc also (C) How obscure is the technology? but that’s obvious)
My experience is that the biggest factor is how large is the codebase, and can I zoom into a specific spot where the change needs to be made and implement it divorced from all the other context.
Since the answer to both of those in may day job is “large” and “only sometimes” the maximum benefit of an LLM to me is highly limited. I basically use it as a better search engine for things I can’t remember off hand how to do.
Also, I care about the quality of the code I commit (this code is going to be continuously worked on), and I write better code than the LLM, so I tend to rewrite it all anyway, which again allows the LLM to save me some time, but severely limits the potential upside.
When I’m writing one off bash scripts, yeah it’s vibe coding all the way.
Yeah, that makes sense. I think with a big enough codebase some specific tooling might be necessary, a generic “dump everything in the context” won’t help.
Disagree. It sounds like by “being specific” you mean that you explain how you want the task to be done to the AI, which in my opinion can only be mildly useful.
When I am specific to an AI about what I want, I usually still get buggy results unless the solution is easy. (And asking the AI to debug is only sometimes successful, so if I want to fix it I have to put in a lot of work to understand the code the AI wrote carefully to debug it.)
Just to give an example, here is the kind of prompt I am thinking of. I am being very specific about what I want, I think there is very little room for misunderstanding about how I expect the program to behave:
o3-mini gives pure garbage hallucination for me on this one, like it’s not even close.
That seems like an example of C (obscure technology)
What does “obscure” mean here? (If you label the above “obscure”, I feel like every query I consider “non-trivial” could be labeled obscure.)
I don’t think Lean is obscure, it’s one of the most popular proof assistants nowadays. The whole Lean codebase should be in the AIs training corpus (in fact that’s why I deliberately made sure to specify an older version, since I happen to know that the olean header changed recently.) If you have access to the codebase, and you understand the object representation, the solution is not too hard.
Here is the solution I wrote just now:[1]
(It’s minified to both fit in the comment better and to make it less useful as future AI training data, hopefully causing this question to stay useful for testing AI skills.)
I wrote this after I wrote the previous comment, my expectation that this should be a not-too-hard problem was not informed by me actually attempting it, only by rough knowledge of how Lean represents objects and that they are serialized pretty much as-is.
The quantity of lean code in the world is orders of magnitude smaller than the quantity of python code. I imagine that most people reporting good results are using very popular languages. My experience using Gemini 2.5 to write lean was poor.
Reasons for my pessimism about mechanistic interpretability.
Epistemic status: I’ve noticed most AI safety folks seem more optimistic about mechanistic interpretability than I am. This is just a quick list of reasons for my pessimism. Note that I don’t have much direct experience with mech interp, and this is more of a rough brain dump than a well-thought-out take.
Interpretability just seems harder than people expect
For example, recently GDM decided to deprioritize SAEs. I think something like a year ago many people believed SAEs are “the solution” that will make mech interp easy? Pivots are normal and should be expected, but in short timelines we can’t really afford many of them.
So far mech interp results have “exists” quantifier. We might need “all” for safety.
What we have: “This is a statement about horses, and you can see that this horse feature here is active”
What we need: “Here is how we know whether a statement is about horses or not”
Note that these things might be pretty far away, consider e.g. “we know this substance causes cancer” and “we can tell for an arbitrary substance whether it causes cancer or not”.
No specific plans for how mech interp will help
Or maybe there are some and I don’t know them?
Anyway, I feel people often say something like “If we find the deception feature, we’ll know whether models are lying to us, therefore solving deceptive alignment”. This makes sense, but how will we know whether we’ve really found the deception feature?
I think this is related to the previous point (about exists/all quantifiers): we don’t really know how to build mech interp tools that give some guarantees, so it’s hard to imagine what such solution would look like.
Future architectures might make interpretability harder
I think ASI probably won’t be a simple non-recurrent transformer. No strong justification here—just the very rough “It’s unlikely we found something close to the optimal architecture so early”. This leads to two problems:
Our established methods might no longer work, and we might have too little time to develop new ones
The new architecture will likely be more complex, and thus mech interp might get harder
Is there a good reason to believe interpretability of future systems will be possible?
The fact that things like steering vectors or linear probes or SAEs-with-some-reasonable-width somewhat work is a nice feature of the current systems. There’s no guarantee that this will hold for the future, more efficient, systems—they might become extremely polysemantic instead (see here). This is not necessarily true: maybe an optimal network learns to operate on something like natural abstractions, or maybe we’ll favor interpretable architecture over slightly-more-efficient-but-uninterpretable architectures.
----
That’s it. Critical comments are very welcome.
I lay out my thoughts on this in section 6.6 of the deepmind AGI safety approach. I broadly agree the guarantees seem pretty doomed. But I also think guarantees just seem obviously doomed by all methods because neural networks are far too complex and we’ll need to have plans that do not depend on them. I think there are much more realistic use cases for interpretability than aiming for guarantees
I’ve talked to a lot of people about mech interp so I can enumerate some counterarguments. Generally I’ve been surprised by how well people in AI safety can defend their own research agendas. Of course, deciding whether the counterarguments outweigh your arguments is a lot harder than just listing them, so that’ll be an exercise for readers.
I think researchers already believe this. Recently I read https://www.darioamodei.com/post/the-urgency-of-interpretability, and in it, Dario expects mech interp to take 5-10 years before it’s as good as an MRI.
Forall quantifiers are nice, but a lot of empirical sciences like medicine or economics have been pretty successful without them. We don’t really know how most drugs work, and the only way to soundly disprove a claim like “this drug will cause people to mysteriously drop dead 20 years later” is to run a 20 year study. We approve new drugs in less than 20 years, and we haven’t mysteriously dropped dead yet.
Similarly we can do a lot in mech interp to build confidence without any forall quantifiers, like building deliberately misaligned models and seeing if mech interp techniques can find the misalignment.
The people I’ve talked to believe interp will be generally helpful for all types of plans, and I haven’t heard anything specific either. Here’s a specific plan I made up. Hopefully it doesn’t suck.
Basically just combine prosaic alignment and mech interp. This might sound stupid on paper, most forbidden technique and whatnot, but using mech interp we can continuously make misalignment harder and keep it above the capability levels of frontier AIs. This might not work long term, but all long term alignment plans seem like moonshots right now, and we’ll have much better ideas later on when we know more about AIs (e.g. after we solve mech interp!).
Transformers haven’t changed much in the past 7 years, and big companies have already invested a ton of money into transformer specific performance optimizations. I just talked to some guys at a startup who spent hundreds of millions building a chip that can only run transformer inference. I think lots of people believe transformers will be around for a while. Also, it’s somewhat of a self fulfilling prophecy because new architectures now have to compete against hyperoptimized transformers, not just regular transformers.
I’ve been saying this for two years: https://www.lesswrong.com/posts/RTmFpgEvDdZMLsFev/mechanistic-interpretability-is-being-pursued-for-the-wrong
Though I have updates somewhat since then—I am slightly more enthusiastic about weak forms of the linear representation hypothesis, but LESS confident that we’ll learn any useful algorithms through mech interp (because I’ve seen how tricky it is to find algorithms we already know in simple transformers).