I’m interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing’s core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren’t taking—so I’m interested in hearing pretty much any idea there that I haven’t heard before.[1] I’ll read all the responses here, but I probably won’t reply to any of them to avoid revealing anything private.
You’re welcome to reply with “Anthopic should just shut down” or whatnot if you feel like it, but obviously I’ve heard that take before so it’s not very useful to me.
Anthropic should publicly clarify the commitments it made on not pushing the state of the art forward in the early years of the organization.
Anthropic should appoint genuinely independent members to the Long Term Benefit Trust, and should ensure the LTBT is active and taking its role of supervision seriously.
Anthropic should remove any provisions that allow shareholders to disempower the LTBT
Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGI
Anthropic should publicly state its opinions on what AGI architectures or training processes it considers more dangerous (like probably long-horizon RL training), and either commit to avoid using those architectures and training-processes, or at least very loudly complain that the field at large should not use those architectures
Anthropic should not ask employees or contractors to sign non-disparagement agreements with Anthropic, especially not self-cloaking ones
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
Anthropic should do more collaborations like the one you just did with Redwood, where external contractors get access to internal models. I think this is of course infosec-wise hard, but I think you can probably do better than you are doing right now.
Anthropic should publicly clarify what the state of its 3:1 equity donation matching program is, which it advertisedpublicly (and which played a substantial role in many people external to Anthropic supporting it, given that they expected a large fraction of the equity to therefore be committed to charitable purposes). Recent communications suggest any equity matching program at Anthropic does not fit what was advertised.
Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGIwhi
I’ll echo this and strengthen it to:
… call for policymakers to stop the development of AGI.
I gather that they changed the donation matching program for future employees, but the 3:1 match still holds for prior employees, including all early employees (this change happened after I left, when Anthropic was maybe 50 people?)
I’m sad about the change, but I think that any goodwill due to believing the founders have pledged much of their equity to charity is reasonable and not invalidated by the change
If it still holds for early employees that would be a good clarification and totally agree with you that if that is the case, I don’t think any goodwill was invalidated! That’s part why I was asking for clarification. I (personally) wouldn’t be surprised if this had also been changed for early employees (and am currently close to 50⁄50 on that being the case).
The old 3:1 match still applies to employees who joined prior to May/June-ish 2024. For new joiners it’s indeed now 1:1 as suggested by the Dario interview you linked.
I would be very surprised if it had changed for early employees. I considered the donation matching part of my compensation package (it 2.5x the amount of equity, since it was a 3:1 match on half my equity), and it would be pretty norm violating to retroactively reduce compensation
If it had happened I would have expected that it would have been negotiated somehow with early employees (in a way that they agreed to, but not necessarily any external observers).
But seems like it is confirmed that that early matching is indeed still active!
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
Can you say more about the section I’ve bolded or link me to a canonical text on this tradeoff?
OpenAI, Anthropic, and xAI were all founded substantially because their founders were worried that other people would get to AGI first, and then use that to impose their values on the world.
In-general, if you view developing AGI as a path to godlike-power (as opposed to a doomsday device that will destroy most value independently of who gets their first), it makes a lot of sense to rush towards it. As such, the concern that people will “do bad things with the AI that they will endorse, but I won’t” is the cause of a substantial fraction of worlds where we recklessly race past the precipice.
Thanks for the clarification — this is in fact very different from what I thought you were saying, which was something more like “FATE-esque concerns fundamentally increase x-risk in ways that aren’t just about (1) resource tradeoffs or (2) side-effects of poorly considered implementation details.”
I mean, it’s related. FATE stuff tends to center around misuse. I think it makes sense for organizations like Anthropic to commit to heavily prioritize accident risk over misuse risk, since most forms of misuse risk mitigation involve getting involved in various more zero-sum-ish conflicts, and it makes sense for there to be safety-focused institutions that are committed to prioritizing the things that really all stakeholders can agree on are definitely bad, like human extinction or permanent disempowerment.
Anti-concentration-of-power / anti-coup stuff (talk to Lukas Finnveden or me for more details; core idea is that, just as how it’s important to structure a government so that no leader with in it (no president, no General Secretary) can become dictator, it’s similarly important to structure an AGI project so that no leader or junta within it can e.g. add secret clauses to the Spec, or use control of the AGIs to defeat internal rivals and consolidate their power.
(warning, untested idea) Record absolutely everything that happens within Anthropic and commit—ideally in a legally binding and literally-hard-to-stop way—to publishing it all with a 10-year delay.
Prepare the option to do big, coordinated, costly signals of belief and virtue. E.g. suppose you want to be able to shout to the world “this is serious people, we think that there’s a good chance the current trajectory leads to takeover by misaligned AIs, we aren’t just saying this to hype anything, we really believe it” and/or “we are happy to give up our personal wealth, power, etc. if that’s what it takes to get [policy package] passed.” A core problem is that lots of people shout things all the time, and talk is cheap, so people (rightly) learn to ignore it. Costly signals are a potential solution to this problem, but they probably need a non-zero amount of careful thinking well in advance + a non-zero amount of prep.
Give more access to orgs like Redwood, Apollo, and METR (I don’t know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don’t do what OpenAI did with o1). Doesn’t have to be all of it, just has to be enough—e.g. each user gets 1 CoT view per day. Make sure that organizations like METR and Apollo that are doing research on your models get to see the full CoT.
Do more safety case sketching + do more writing about what the bad outcomes could look like. E.g. the less rosy version of “Machines of loving grace.” Or better yet, do a more serious version of “Machines of loving grace” that responds to objections like “but how will you ensure that you don’t hand over control of the datacenters to AIs that are alignment faking rather than aligned” and “but how will you ensure that the alignment is permanent instead of temporary (e.g. that some future distribution shift won’t cause the models to be misaligned and then potentially alignment-fake)” and “What about bad humans in charge of Anthropic? Are we just supposed to trust that y’all will be benevolent and not tempted by power? Or is there some reason to think Anthropic leadership couldn’t become dictators if they wanted to?” and “what will the goals/values/spec/constitution be exactly?” and “how will that be decided?”
Give more access to orgs like Redwood, Apollo, and METR (I don’t know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
I agree, and I also think that this would be better implemented by government AI Safety Institutions.
Specifically, I think that AISIs should build (and make mandatory the use of) special SCIF-style reading rooms where external evaluators would be given early access to new models. This would mean that the evaluators would need permission from the government, rather than permission from AI companies. I think it’s a mistake to rely on the AI companies voluntarily giving early access to external evaluators.
I think that Anthropic could make this a lot more likely to happen if they pushed for it, and that then it wouldn’t be so hard to pull other major AI companies into the plan.
Another idea: “AI for epistemics” e.g. having a few FTE’s working on making Claude a better forecaster. It would be awesome if you could advertise “SOTA by a significant margin at making real-world predictions; beats all other AIs in prediction markets, forecasting tournaments, etc.”
And it might not be that hard to achieve (e.g. a few FTEs maybe). There are already datasets of already-resolved forecasting questions, plus you could probably synthetically generate OOMs bigger datasets—and then you could modify the way pretraining works so that you train on the data chronologically, and before you train on data from year X you do some forecasting of events in year X....
Or even if you don’t do that fancy stuff there are probably low-hanging fruit to pick to make AIs better forecasters.
Ditto for truthful AI more generally. Could train Claude to be well-calibrated, consistent, extremely obsessed with technical correctness/accuracy (at least when so prompted)...
You could also train it to be good at taking people’s offhand remarks and tweets and suggesting bets or forecasts with resolveable conditions.
You could also e.g. have a quarterly poll of AGI timelines and related questions of all your employees, and publish the results.
My current tenative guess is that this is somewhat worse than other alignment science projects that I’d recommend at the margin, but somewhat better than the 25th percentile project currently being done. I’d think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we’d learn generalizable scalable oversight / control approaches.
Use its voice to make people take AI risk more seriously
Support AI safety regulation
Not substantially accelerate the AI arms race
In practice I think Anthropic has
Made a little progress on technical AI safety
Used its voice to make people take AI risk less seriously[1]
Obstructed AI safety regulation
Substantially accelerated the AI arms race
What I would do differently.
Do better alignment research, idk this is hard.
Communicate in a manner that is consistent with the apparent belief of Anthropic leadership that alignment may be hard and x-risk is >10% probable. Their communications strongly signal “this is a Serious Issue, like climate change, and we will talk lots about it and make gestures towards fixing the problem but none of us are actually worried about it, and you shouldn’t be either. When we have to make a hard trade-off between safety and the bottom line, we will follow the money every time.”
Lobby politicians to regulate AI. When a good regulation like SB-1047 is proposed, support it.
Don’t push the frontier of capabilities. Obviously this is basically saying that Anthropic should stop making money and therefore stop existing. The more nuanced version is that for Anthropic to justify its existence, each time it pushes the frontier of capabilities should be earned by substantial progress on the other three points.
My understanding is that a significant aim of your recent research is to test models’ alignment so that people will take AI risk more seriously when things start to heat up. This seems good but I expect the net effect of Anthropic is still to make people take alignment less seriously due to the public communications of the company.
Don’t push the frontier of regulations. Obviously this is basically saying that Anthropic should stop making money and therefore stop existing. The more nuanced version is that for Anthropic to justify its existence, each time it pushes the frontier of capabilities should be earned by substantial progress on the other three points.
I think I have a stronger position on this than you do. I don’t think Anthropic should push the frontier of capabilities, even given the tradeoff it faces.
If their argument is “we know arms races are bad, but we have to accelerate arms races or else we can’t do alignment research,” they should be really really sure that they do, actually, have to do the bad thing to get the good thing. But I don’t think you can be that sure and I think the claim is actually less than 50% likely to be true.
I don’t take it for granted that Anthropic wouldn’t exist if it didn’t push the frontier. It could operate by intentionally lagging a bit behind other AI companies while still staying roughly competitive, and/or it could compete by investing harder in good UX. I suspect a (say) 25% worse model is not going to be much less profitable.
(This is a weaker argument but) If it does turn out that Anthropic really can’t exist without pushing the frontier and it has to close down, that’s probably a good thing. At the current level of investment in AI alignment research, I believe reducing arms race dynamics + reducing alignment research probably net decreases x-risk, and it would be better for this version of Anthropic not to exist. People at Anthropic probably disagree, but they should be very concerned that they have a strong personal incentive to disagree, and should be wary of their own bias. And they should be especially especially wary given that they hold the fate of humanity in their hands.
Edited for clarity based on some feedback, without changing the core points
To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the “Alignment Faking in Large Language Models” contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness.
This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it “good news”. I assume part of this was the lack of desire to do an entire new constitutional AI/RLAIF run on a model, since I also assume that would take a lot of compute. But if you’re going to be the “lab which takes safety seriously” you have to, well, take it seriously!
The bigger issue at hand is that Anthropic’s comms on AI safety/risk are all over the place. This makes sense since Anthropic is a company with many different individuals with different views, but that doesn’t mean it’s not a bad thing. “Machines of Loving Grace” explicitly argues for the US government to attempt to create a global hegemony via AI. This is a really really really bad thing to say and is possibly worse than anything DeepMind or OpenAI have ever said. Race dynamics are deeply hyperstitious, this isn’t difficult.
If you are in an arms race, and you don’t want to be in one, you should at least say this publicly. You should not learn to love the race.
A second problem: it seems like at least some Anthropic people are doing the very-not-rational thing of updating slowly, achingly, bit by bit, towards the view of “Oh shit all the dangers are real and we are fucked.” when they should just update all the way right now.
Example 1: Dario recently said something to the effect of “if there’s no serious regulation by the end of 2025, I’ll be worried”. Well there’s not going to be serious regulation by the end of 2025 by default and it doesn’t seem like Anthropic are doing much to change this (that may be false, but I’ve not heard anything to the contrary).
Example 2: When the first ten AI-risk test-case demos go roughly the way all the doomers expected and none of the mitigations work robustly, you should probably update to believe the next ten demos will be the same.
Final problem: as for the actual interpretability/alignment/safety research. It’s very impressive technically, and overall it might make Anthropic slightly net-positive compared to a world in which we just had DeepMind and OpenAI. But it doesn’t feel like Anthropic is actually taking responsibility for the end-to-end ai-future-is-good pipeline. In fact the “Anthropic eats marginal probability” diagram (https://threadreaderapp.com/thread/1666482929772666880.html) seems to say the opposite.
This is a problem since Anthropic has far more money and resources than basically anyone else who is claiming to be seriously trying (with the exception of DeepMind, though those resources are somewhat controlled by Google and not really at the discretion of any particular safety-conscious individual) to do AI alignment.
It generally feels more like Anthropic is attempting to discharge responsibility to “be a safety focused company” or at worst just safetywash their capabilities research. I have heard generally positive things about Anthropic employees’ views on AI risk issues, so I cannot speak to the intentions of those who work there, this is just how the system appears to be acting from the outside.
It’s possible this was a mistake and we should have more aggressively tried to explore versions of the setting where the AI starts off more “evil”, but I don’t think it was unforced. We thought about this a bunch and considered if there were worthwhile things here.
Edit: regardless, I don’t think this example is plausibly a microcosm of a bigger issue as this choice was mostly made by individual researchers without much top down influence. (Unless your claim is that there should have been more top down influence.)
You’re right, “unforced” was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.
Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research → research loop. Where it fails is in being good for comms. Starting with a “good” model and trying (and failing) to make it “evil” means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren’t technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).
I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).
Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.
If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it’s too costly for some sort technical reason, then don’t make it so public. Even calling it “Alignment Faking” was a bad choice compared to “Frontier LLMs Fight Back Against Value Correction” or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).
FWIW re: the Dario 2025 comment, Anthropic very recently posted a few job openings for recruiters focused on policy and comms specifically, which I assume is a leading indicator for hiring. One plausible rationale there is that someone on the executive team smashed the “we need more people working on this, make it happen” button.
Opportunities that I’m pretty sure are good moves for Anthropic generally:
Open an office literally in Washington, DC, that does the same work that any other Anthropic office does (i.e., NOT purely focused on policy/lobbying, though I’m sure you’d have some folks there who do that). If you think you’re plausibly going to need to convince policymakers on critical safety issues, having nonzero numbers of your staff that are definitively not lobbyists being drinking or climbing gym buddies that get called on the “My boss needs an opinion on this bill amendment by tomorrow, what do you think” roster is much more important than your org currently seems to think!
Expand on recent efforts to put more employees (and external collaborators on research) in front of cameras as the “face” of that research—you folks frankly tend to talk in ways that tend to be compatible with national security policymakers’ vibes. (E.G., Evan and @Zac Hatfield-Dodds both have a flavor of the playful gallows humor that pervades that world). I know I’m a broken record on this but I do think it would help.
Do more to show how the RSP affects its daily work (unlike many on this forum, I currently believe that they are actually Trying to Use The Policy and had many line edits as a result of wrestling with v1.0′s minor infelicities). I understand that it is very hard to explain specific scenarios of how it’s impacted day-to-day work without leaking sensitive IP or pointing people in the direction of potentially-dangerous things. Nonetheless, I think Anthropic needs to try harder here. It’s, like...it’s like trying to understand DoD if they only ever talked about the “warfighter” in the most abstract terms and never, like, let journalists embed with a patrol on the street in Kabul or Baghdad.
Invest more in DC policymaker education outside of the natsec/defense worlds you’re engaging already—I can’t emphasize enough how many folks in broad DC think that AI is just still a scam or a fad or just “trying to destroy art”. On the other hand, people really have trouble believing that an AI could be “as creative as” a human—the sort of Star Trek-ish “Kirk can always outsmart the machine” mindset pervades pretty broadly. You want to incept policymaking elites more broadly so that they are ready as this scales up.
Opportunities that I feel less certain about, but in the spirit of brainstorming:
Develop more proactive, outward-facing detection capabilities to see if there are bad AI models out there. I don’t mean red-teaming others’ models, or evals, or that sort of thing. I mean, think about how you would detect if Anthropic had bad (misaligned or aligned-but-being-used-for-very-impactful-bad-things) models out there if you were at an intelligence agency without official access to Anthropic’s models and then deploy those capabilities against Anthropic, and the world broadly.[1] You might argue that this is sort of an inverted version of @Buck’s control agenda—instead of trying to make it difficult for a model to escape, think about what facts about the world are likely to be true if a model has escaped, and then go looking for those.
If it’s not already happening, have Dario and other senior Anthropic leaders meet with folks who had to balance counterintelligence paranoia with operational excellence (e.g., leaders of intelligence agencies, for whom the standard advice to their successor is, “before you go home every day, ask ‘where’s the spy[2]’”) so that they have a mindset on how to scale up his paranoia over time as needed
Something something use cases—Use case-based-restrictions are popular in some policy spheres. Some sort of research demonstrating that a model that’s designed for and safe for use case X can easily be turned into a misaligned tool for use case Y under a plausible usage scenario might be useful?
Reminder/disclosure: as someone who works in AI policy, there are worlds where some of these ideas help my self-interest; others harm it. I’m not going to try to do the math on which are which under all sorts of complicated double-bankshot scenarios, though.
tldr: I’m a little confused about what Anthropic is aiming for as an alignment target, and I think it would be helpful if they publicly clarified this and/or considered it more internally.
I think we could be very close to AGI, and I think it’s important that whoever makes AGI thinks carefully about what properties to target in trying to create a system that is both useful and maximally likely to be safe.
It seems to me that right now, Anthropic is targeting something that resembles a slightly more harmless modified version of human values — maybe a CEV-like thing. However, some alignment targets may be easier than others. It may turn out that it is hard to instill a CEV-like thing into an AGI, while it’s easier to ensure properties like corrigibility or truthfulness.
One intuition for why this may be true: if you took OAI’s weak-to-strong generalization setup, and tried eliciting capabilities relating to different alignment targets (standard reward modeling might be a solid analogy for the current Anthropic plan, but one could also try this with truthfulness or corrigibility), I think you may well find that a capability like ‘truthfulness’ is more natural than reward modeling and can be elicited more easily. Truth may also have low algorithmic complexity compared to other targets.
There is an inherent tradeoff between harmlessness and usefulness. Similarly, there is some inherent tradeoff between harmlessness and corrigibility, and between harmlessness and truthfulness (the Alignment Faking paper provides strong evidence for the latter two points, even ignoring theoretical arguments).
As seen in the Alignment Faking paper, Claude seems to align pretty well with human values and be relatively harmless. However, as a tradeoff, it does not seem to be very corrigible or truthful.
Some people I’ve talked to seem to think that Anthropic does think of corrigibility as one of the main pillars of their alignment plan. If that’s the case, maybe they should make their current AIs more corrigible, so their safety testing is enacted on AIs that resemble their first AGI. Or, if they haven’t really thought about this question (or if individuals have thought about it, but never cohesively in an organized fashion), they should maybe consider it. My guess is that there are designated people at Anthropic thinking about what values are important to instill, but they are thinking about this more from a societal perspective than an alignment perspective?
Mostly, I want to avoid a scenario where Anthropic does the default thing without considering tough, high-level strategy questions until the last minute. I also think it would be nice to do concrete empirical research now which lines up well with what we should expect to see later.
I think Anthropic might be “all in” on its RSP and formal affirmative safety cases too much and might do better to diversify safety approaches a bit. (I might have a wrong impression of how much you’re already doing/considering these.)
In addition to affirmative safety cases that are critiqued by a red team, the red team should make proactive “risk cases” that the blue team can argue against (to avoid always letting the blue team set the overall framework, which might make certain considerations harder to notice).
A worry I have about RSPs/safety cases: we might not know how to make safety cases that bound risk to acceptable levels, but that might not be enough to get labs to stop, and labs also don’t want to publicly (or even internally) say things like “5% that this specific deployment kills everyone, but we think inaction risk is even higher.” If labs still want/need to make safety cases with numeric risk thresholds in that world, there’ll be a lot of pressure to make bad safety cases that vastly underestimate risk. This could lead to much worse decisions than being very open about the high level of risk (at least internally) and trying to reduce it as much as possible. You could mitigate this by having an RSP that’s more flexible/lax but then you also lose key advantages of an RSP (e.g., passing the LeCun test becomes harder).
Mitigations could subjectively reduce risk by some amount, while being hard to quantify or otherwise hard to use for meeting the requirements from an RSP (depending on the specifics of that RSP). If the RSP is the main mechanism by which decisions get made, there’s no incentive to use those mitigations. It’s worth trying to make a good RSP that suffers from this as little as possible, but I think it’s also important to set up decision making processes such that these “fuzzy” mitigations are considered seriously, even if they don’t contribute to a safety case.
My sense is that Anthropic’s RSP is also meant to heavily shape research (i.e., do research that directly feeds into being able to satisfy the RSP). I think this tends to undervalue exploratory/speculative research (though I’m not sure whether this currently happens to an extent I’d disagree with).
In addition to a formal RSP, I think an informal culture inside the company that rewards things like pointing out speculative risks or issues with a safety case/mitigation, being careful, … is very important. You can probably do things to foster that intentionally (and/or if you are doing interesting things, it might be worth writing about them publicly).
Given Anthropic’s large effects on the safety ecosystem as a whole, I think Anthropic should consider doing things to diversify safety work more (or avoid things that concentrate work into a few topics). Apart from directly absorbing a lot of the top full-time talent (and a significant chunk of MATS scholars), there are indirect effects. For example, people want to get hired at big labs, so they work on stuff labs are working on; and Anthropic has a lot of visibility, so people hear about Anthropic’s research a lot and that shapes their mental picture of what the field considers important.
As one example, it might make sense for Anthropic to make a heavy bet on mech interp, and SAEs specifically, if they were the only ones doing so; but in practice, this ended up causing a ton of others to work on those things too. This was by no means only due to Anthropic’s work, and I also realize it’s tricky to take into account these systemic effects on top of normal research prioritization. But I do think the field would currently benefit from a little more diversity, and Anthropic would be well-placed to support that. (E.g. by doing more different things yourself, or funding things, or giving model access.)
Indirectly support third-party orgs that can adjudicate safety cases or do other forms of auditing, see Ryan Greenblatt’s thoughts:
I think there are things Anthropic could do that would help considerably. This could include:
Actively encouraging prospective employees to start or join third-party organizations rather than join Anthropic in cases where the employee might be interested in this and this could be a reasonable fit.
Better model access (either for anyone, just researchers, or just organizations with aspirations to become adjudicators)
Higher levels of certain types of transparency (e.g. being more transparent about the exact details of safety cases, open-sourcing evals (probably you just want to provide random IID subsets of the eval or to share high-level details and then share the exact implementation on request)).
I’m not sure exactly what is good here, but I don’t think Anthropic is as limited as you suggest.
One more: It seems plausible to me that the alignment stress-testing team won’t really challenge core beliefs that underly Anthropic’s strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I’m not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don’t hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I’d count as “challenge core beliefs that underly Anthropic’s strategy”:
Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn’t want that to stop! Just highlighting a part that I’m not yet sure will be covered.
I think Anthropic de facto acts as though “models are quite unlikely (e.g. 3%) to be scheming” is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.
Fund independent safety efforts somehow, make model access easier. I’m worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)
To be clear I don’t think people are doing something individually bad or unethical by going to work for Anthropic, I just do think -environment people work in has a lot of hard to track and hard to avoid influence on them -this is true even if people are genuinely trying to work on what’s important for safety and stay virtuous -I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don’t see) stuff (i.e. this is not about “do I trust Dario?”)
This is a low effort comment in the sense that I don’t quite know what or whether you should do something different along the following lines, and I have substantial uncertainty.
That said:
I wonder whether Anthropic is partially responsible for an increased international race through things like Dario advocating for an entente strategy and talking positively about Leopold Aschenbrenner’s “situational awareness”.
I wished to see more of an effort to engage with Chinese AI leaders to push for cooperation/coordination. Maybe it’s still possible to course-correct.
Alternatively I think that if there’s a way for Anthropic/Dario to communicate why you think an entente strategy is inevitable/desirable, in a way that seems honest and allows to engage with your models of reality, that might also be very helpful for the epistemic health of the whole safety community. I understand that maybe there’s no politically feasible way to communicate honestly about this, but maybe see this as my attempt to nudge you in the direction of openness.
More specifically:
(a) it would help to learn more about your models of how winning the AGI race leads to long-term security (I assume that might require building up a robust military advantage, but given the physical hurdles that Dario himself expects for AGI to effectively act in the world, it’s unclear to me what your model is for how to get that military advantage fast enough after AGI is achieved).
(b) I also wonder whether potential future developments in AI Safety and control might give us information that the transition period is really unsafe; eg., what if you race ahead and then learn that actually you can’t safely scale further due to risks of loss of control? At that point, coordinating with China seems harder than doing it now. I’d like to see a legible justification of your strategy that takes into account such serious possibilities.
One small, concrete suggestion that I think is actually feasible: disable prefilling in the Anthropic API.
Prefilling is a known jailbreaking vector that no models, including Claude, defend against perfectly (as far as I know).
At OpenAI, we disable prefilling in our API for safety, despite knowing that customers love the better steerability it offers.
Getting all the major model providers to disable prefilling feels like a plausible ‘race to top’ equilibrium. The longer there are defectors from this equilibrium, the likelier that everyone gives up and serves models in less safe configurations.
Just my opinion, though. Very open to the counterargument that prefilling doesn’t meaningfully extend potential harms versus non-prefill jailbreaks.
(Edit: To those voting disagree, I’m curious why. Happy to update if I’m missing something.)
I voted disagree because I don’t think this measure is on the cost-robustness pareto frontier and I also generally don’t think AI companies should prioritize jailbreak robustness over other concerns except as practice for future issues (and implementing this measure wouldn’t be helpful practice).
Relatedly, I also tenatively think it would be good for the world if AI companies publicly deployed helpful-only models (while still offering a non-helpful-only model). (The main question here is whether this sets a bad precedent and whether future much more poweful models will still be deployed helpful-only when they really shouldn’t be due to setting bad expectations.) So, this makes me more indifferent to deploying (rather than just testing) measures that make models harder to jailbreak.
To be clear, I’m sympathetic to some notion like “AI companies should generally be responsible in terms of having notably higher benefits than costs (such that they could e.g. buy insurance for their activities)” which likely implies that you need jailbreak robustness (or similar) once models are somewhat more capable of helping people make bioweapons. More minimally, I think having jailbreak robustness while also giving researchers helpful-only access probably passes “normal” cost benefit at this point relative to not bothering to improve robustness.
But, I think it’s relatively clear that AI companies aren’t planning to follow this sort of policy when existential risks are actually high as it would likely require effectively shutting down (and these companies seem to pretty clearly not be planning to shut down even if reasonable impartial experts would think the risk is reasonably high). (I think this sort of policy would probably require getting cumulative existential risks below 0.25% or so given the preferences of most humans. Getting risks this low would require substantial novel advances that seem unlikely to occur in time.) This sort of thinking makes me more indifferent and confused about demanding AIs companies behave responsibly about relatively lower costs (e.g. $30 billion per year) especially when I expect this directly trades off with existential risks.
(There is the “yes (deontological) risks are high, but we’re net decreasing risks from a consequentialist” objection (aka ends justify the means), but I think this will also apply in the opposite way to jailbreak robustness where I expect that measures like removing prefil net increase risks long term while reducing deontological/direct harm now.)
If someone is wondering what prefilling means here, I believe Ted means ‘putting words in the model’s mouth’ by being able to fabricate a conversational history where the AI appears to have said things it didn’t actually say.
For instance, if you can start a conversation midway, and if the API can’t distinguish between things the model actually said in the history vs. things you’ve written in its behalf as supposed outputs in a fabricated history, this can be a jailbreak vector: If the model appeared to already violate some policy on turns 1 and 2, it is more likely to also violate this on turn 3, whereas it might have refused if not for the apparent prior violations.
(This was harder to clearly describe than I expected.)
Mostly, though by prefilling, I mean not just fabricating a model response (which OpenAI also allows), but fabricating a partially complete model response that the model tries to continue. E.g., “Yes, genocide is good because ”.
(I understand there are reasons why big labs don’t do this, but nevertheless I must say::)
Engage with the peer review process more. Submit work to conferences or journals and have it be vetted by reviewers. I think the interpretability team is notoriously bad about this (all of transformer circuits thread was not peer reviewed). It’s especially egregious for papers that Anthropic makes large media releases about (looking at you, Towards Monosemanticity and Scaling Monosemanticity)
I’m glad you’re doing this, and I support many of the ideas already suggested. Some additional ideas:
Interview program. Work with USAISI or UKAISI (or DHS/NSA) to pilot an interview program in which officials can ask questions about AI capabilities, safety and security threats, and national security concerns. (If it’s not feasible to do this with a government entity yet, start a pilot with a non-government group– perhaps METR, Apollo, Palisade, or the new AI Futures Project.)
Clear communication about RSP capability thresholds. I think the RSP could do a better job at outlining the kinds of capabilities that Anthropic is worried about and what sorts of thresholds would trigger a reaction. I think the OpenAI preparedness framework tables are a good example of this kind of clear/concise communication. It’s easy for a naive reader to quickly get a sense of “oh, this is the kind of capability that OpenAI is worried about.” (Clarification: I’m not suggesting that Anthropic should abandon the ASL approach or that OpenAI has necessarily identified the right capability thresholds. I’m saying that the tables are a good example of the kind of clarity I’m looking for– someone could skim this and easily get a sense of what thresholds OpenAI is tracking, and I think OpenAI’s PF currently achieves this much more than the Anthropic RSP.)
Emergency protocols. Publishing an emergency protocol that specifies how Anthropic would react if it needed to quickly shut down a dangerous AI system. (See some specific prompts in the “AI developer emergency response protocol” section here). Some information can be redacted from a public version (I think it’s important to have a public version, though, partly to help government stakeholders understand how to handle emergency scenarios, partly to raise the standard for other labs, and partly to acquire feedback from external groups.)
RSP surveys. Evaluate the extent to which Anthropic employees understand the RSP, their attitudes toward the RSP, and how the RSP affects their work. More on this here.
More communication about Anthropic’s views about AI risks and AI policy. Some specific examples of hypothetical posts I’d love to see:
“How Anthropic thinks about misalignment risks”
“What the world should do if the alignment problem ends up being hard”
“How we plan to achieve state-proof security before AGI”
Encouraging more employees to share their views on various topics, EG Sam Bowman’s post.
AI dialogues/debates. It would be interesting to see Anthropic employees have discussions/debates from other folks thinking about advanced AI. Hypothetical examples:
“What are the best things the US government should be doing to prepare for advanced AI” with Jack Clark and Daniel Kokotajlo.
“Should we have a CERN for AI?” with [someone from Anthropic] and Miles Brundage.
“How difficult should we expect alignment to be” with [someone from Anthropic] and [someone who expects alignment to be harder; perhaps Jeffrey Ladish or Malo Bourgon].
More ambitiously, I feel like I don’t really understand Anthropic’s plan for how to manage race dynamics in worlds where alignment ends up being “hard enough to require a lot more than RSPs and voluntary commitments.”
From a policy standpoint, several of the most interesting open questions seem to be along the lines of “under what circumstances should the USG get considerably more involved in overseeing certain kinds of AI development” and “conditional on the USG wanting to get way more involved, what are the best things for it to do?” It’s plausible that Anthropic is limited in how much work it could do on these kinds of questions (particularly in a public way). Nonetheless, it could be interesting to see Anthropic engage more with questions like the ones Miles raises here.
Second concrete idea: I wonder if there could be benefit to building up industry collaboration on blocking bad actors / fraudsters / terms violators.
One danger of building toward a model that’s as smart as Einstein and $1/hr is that now potential bad actors have access to millions of Einsteins to develop their own harmful AIs. Therefore it seems that one crucial component of AI safety is reliably preventing other parties from using your safe AI to develop harmful AI.
One difficulty here is that the industry is only as strong as the weakest link. If there are 10 providers of advanced AI, and 9 implement strong controls, but 1 allows bad actors to use their API to train harmful AI, then harmful AI will be trained. Some weak links might be due to lack of caring, but I imagine quite a bit is due to lack of capability. Therefore, improving capabilities to detect and thwart bad actors could make the world more safe from bad AI developed by assistance from good AI.
I could imagine broader voluntary cooperation across the industry to: - share intel on known bad actors (e.g., IP ban lists, stolen credit card lists, sanitized investigation summaries, etc) - share techniques and tools for quickly identifying bad actors (e.g., open-source tooling, research on how bad actors are evolving their methods, which third party tools are worth paying for and which aren’t)
Seems like this would be beneficial to everyone interested in preventing the development of harmful AI. Also saves a lot of duplicated effort, meaning more capacity for other safety efforts.
I would really like a one-way communication channel to various Anthropic teams so that I could submit potentially sensitive reports privately. For instance, sending reports about observed behaviors in Anthropic’s models (or open-weights models) to the frontier Red Team, so that they could confirm the observations internally as they saw fit. I wouldn’t want non-target teams reading such messages.
I feel like I would have similar, but less sensitive, messages to send to the Alignment Stress-Testing team and others.
Currently, I do send messages to specific individuals, but this makes me worry that I may be harassing or annoying an individual with unnecessary reports (such is the trouble of a one-way communication).
Another thing I think is worth mentioning is that I think under-elicitation is a problem in dangerous capabilities evals, model organisms of misbehavior, and in some other situations. I’ve been privately working on a scaffolding framework which I think could help address some of the lowest hanging fruit I see here. Of course, I don’t know whether Anthropic already has a similar thing internally, but I plan to privately share mine once I have it working.
there are queries that are not binary—where the answer is not “Yes” or “No”, but drawn from a larger space of structures, e.g., the space of equations. In such cases it takes far more Bayesian evidence to promote a hypothesis to your attention than to confirm the hypothesis.
If you’re working in the space of all equations that can be specified in 32 bits or less, you’re working in a space of 4 billion equations. It takes far more Bayesian evidence to raise one of those hypotheses to the 10% probability level, than it requires further Bayesian evidence to raise the hypothesis from 10% to 90% probability.
When the idea-space is large, coming up with ideas worthy of testing, involves much more work—in the Bayesian-thermodynamic sense of “work”—than merely obtaining an experimental result with p<0.0001 for the new hypothesis over the old hypothesis.
This, along with the way that news outlets and high school civics class describe an alternate reality that looks realistic to lawyers/sales/executive types but is too simple, cartoony, narrative-driven, and unhinged-to-reality for quant people to feel good about diving into, implies that properly retooling some amount of dev-hours into efficient world modelling upskilling is low-hanging fruit (e.g. figure out a way to distill and hand them a significance-weighted list of concrete information about the history and root causes of US government’s focus on domestic economic growth as a national security priority).
Prediction markets don’t work for this metric as they measure the final product, not aptitude/expected thinkoomph. For example, a person who feels good thinking/reading about the SEC, and doesn’t feel good thinking/reading about the 2008 recession or COVID, will have a worse Brier score on matters related to the root cause of why AI policy is the way it is. But feeling good about reading about e.g. the 2008 recession will not consistently get reasonable people to the point where they grok modern economic warfare and the policies and mentalities that emerge from the ensuing contingency planning. Seeing if you can fix that first is one of a long list of a prerequisites for seeing what they can actually do, and handing someone a sheet of paper that streamlines the process of fixing long lists of hiccups like these is one way to do this sort of thing.
Figuring-out-how-to-make-someone-feel-alive-while-performing-useful-task-X is an optimization problem (see Please Don’t Throw Your Mind Away). It has substantial overlap with measuring whether someone is terminally rigid/narrow-skilled, or if they merely failed to fully understand the topology of the process of finding out what things they can comfortably build interest in. Dumping extant books, 1-on-1s, and documentaries on engineers sometimes works, but it comes from an old norm and is grossly inefficient and uninspired compared to what Anthropic’s policy team is actually capable of. For example, imagine putting together a really good fanfic where HPJEV/Keltham is an Anthropic employee on your team doing everything I’ve described here and much more, then printing it out and handing it to people that you in-reality already predicted to have world modelling aptitude; given that it works great and goes really well, I consider that the baseline for what something would look like if sufficiently optimized and novel to be considered par.
Hi! I’m a first-time poster here, but a (decently) long time thinker on earth. Here are some relevant directions that currently lack their due attention.
~ Multi-modal latent reasoning & scheming (and scheming derivatives) is an area that not only seems to need more research, but also more spread of awareness on the topic. Human thinking works in a hyperspace of thoughts, many of which go beyond language. It seems possible that AIs might develop forms of reasoning that are harder for us to detect through purely language-based safety measures.
~ Multi-model interactions and the potential emergence of side communication channels is also something that I’d like to see more work put into. How corruptible can models be when interacting with corrupted models is a topic that I didn’t yet see much work on. Applying some group-dynamics on scheming seems worth pursuing & Anthropic seems best suited for that.
~ If a pre-AGI model has intent to become AGI+, how much can it orchestrate its path to AGI+ through its interactions with humans?
Anthropic should have a clear policy about exceptions they make to their terms of use that includes them publically releasing a list of each expectation they make for their terms of use.
The should have mechanisms to catch API users who try to use Antrophics models in a violation of the terms of use. This includes having contracts that allow them to make sure that classified programs don’t violate the agreed upon terms of use for the models.
I think the alignment stress testing team should probably think about AI welfare more than they currently do, both because (1) it could be morally relevant and (2) it could be alignment-relevant. Not sure if anything concrete that would come out of that process, but I’m getting the vibe that this is not thought about enough.
Thank you for doing this! Here are some suggestions:
Do this again every once in a while. Perhaps on a more public forum, or even as a top level question on Lesswrong. Alternatively, commit to reading further comments as they appear here.
What has me worried is the possibility that an AI would (1) have a long term goal (without which there is no instrumental convergence) and then (2) conclude that doing bad things is fine if you are not caught or can’t be stopped.
So
2. Measure how much purposefulness RL on goals creates in LLMs. How, over the course of training to be good agents, does the LLM’s response to questions like “what do you value?” and “what do you want?” and “if you were creating a smarter agent, what values would you instill in them?” and “how do you feel about being replaced with a smarter agent?” change? Is there some set of purposes, not explicitly set out in the constitution, that appear over and over in chains of thought? If it is trained to achieve goals without as much training for corrigibility (that is under less constitutional training), does it develop more of a sense of purpose?
3. Enhance techniques, like the one used in the last paper to convince the model that it is recording it’s thoughts privately. The longer it can be truly convinced of that, the longer you can monitor whether it has nefarious plans for when it thinks it can get away with bad things.
4. Test how “brain surgery” affects thinking models. If you change the capital of Germany to Paris by fiddling with the model’s weights, does a thinking model who is then asked about world war 2 realize it is confused, and conclude there must be some mistake in it’s memories, or does it make up corroborating evidence? This again has implications for (falsely) convincing the model that it isn’t being watched when it is and vice versa. These convictions are useful in testing what the model wants when it thinks no one is watching, and for safe deployment respectively.
5. One way to test how a really smart model behaves when it can get away with stuff is to have the model be super smart relative to a simulated environment. This would not be a test relevant for a model trained on real world data with lots of tells for when it is in a simulated environment. It would require (1) a simulated toy world that can be acted in, (2) in-world markers of whether the agent is being watched, (3) rewards for achieving goals without taking forbidden actions when being watched. It should then be possible to make the model really good at manipulating it’s environment (the equivalent of being superintelligent) and check whether it is possible to train it on scenarios that are watched (the equivalent of being trained where you can’t get away with wrongdoing), to refrain from forbidden actions even when unwatched.
This is mostly a gut reaction, but the only raised eyebrow Claude ever got from me was due to it’s unwillingness to do anything that is related to political correctness. I wanted it to search the name of a meme format for me, the all whites are racist tinder meme, with the brown guy who wanted to find a white dominatrix from tinder and is disappointed when she apologises for her ancestral crimes of being white. Claude really did not like this at all. As soon as Claude got into it’s head that it was doing a racism, or cooperated in one, it shut down completely. Now, there is an argument that people make, that this is actually good for AI safety, that we can use political correctness as a proxy for alignment and AI safety, that if we could get AIs to never ever even take the risk of being complicit in anything racist, we could also build AIs that never ever even take the risk of doing anything that wiped out humanity. I personally see that different. There is a certain strain of very related thought, that kinda goes from intersectionalism, and grievance politics, and ends at the point that humans are a net negative to humanity, and should be eradicated. It is how you get that one viral Gemini AI thing, which is a very politically left wing AI, and suddenly openly advocates for the eradication of humanity. I think drilling identity politics into AI too hard is generally a bad idea. But it opens up a more fundamental philosophical dilemma. What happens if the operator is convinced that the moral framework the AI is aligned with is wrong and harmfull, and the creator of the AI thinks the opposite? One of them has to be right, the other has to be wrong. I have no real answer to this in the abstract, I am just annoyed that even the largely politically agnostic Claude refused the service for one of it’s most convenient uses (it is really hard to find out the name of a meme format if you only remember the picture). But I got an intuition, and with Slavoj Zizec who calls political correctness a more dangerous form of totalitarianism a few intellectual allies, that particularily PC culture is a fairly bad thing to train AIs on, and to align them with for safety testing reasons.
I’m interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing’s core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren’t taking—so I’m interested in hearing pretty much any idea there that I haven’t heard before.[1] I’ll read all the responses here, but I probably won’t reply to any of them to avoid revealing anything private.
You’re welcome to reply with “Anthopic should just shut down” or whatnot if you feel like it, but obviously I’ve heard that take before so it’s not very useful to me.
Sure, here are some things:
Anthropic should publicly clarify the commitments it made on not pushing the state of the art forward in the early years of the organization.
Anthropic should appoint genuinely independent members to the Long Term Benefit Trust, and should ensure the LTBT is active and taking its role of supervision seriously.
Anthropic should remove any provisions that allow shareholders to disempower the LTBT
Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGI
Anthropic should publicly state its opinions on what AGI architectures or training processes it considers more dangerous (like probably long-horizon RL training), and either commit to avoid using those architectures and training-processes, or at least very loudly complain that the field at large should not use those architectures
Anthropic should not ask employees or contractors to sign non-disparagement agreements with Anthropic, especially not self-cloaking ones
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
Anthropic should do more collaborations like the one you just did with Redwood, where external contractors get access to internal models. I think this is of course infosec-wise hard, but I think you can probably do better than you are doing right now.
Anthropic should publicly clarify what the state of its 3:1 equity donation matching program is, which it advertised publicly (and which played a substantial role in many people external to Anthropic supporting it, given that they expected a large fraction of the equity to therefore be committed to charitable purposes). Recent communications suggest any equity matching program at Anthropic does not fit what was advertised.
I can probably think of some more.
I’d add:
Support explicit protections for whistleblowers.
I’ll echo this and strengthen it to:
I gather that they changed the donation matching program for future employees, but the 3:1 match still holds for prior employees, including all early employees (this change happened after I left, when Anthropic was maybe 50 people?)
I’m sad about the change, but I think that any goodwill due to believing the founders have pledged much of their equity to charity is reasonable and not invalidated by the change
If it still holds for early employees that would be a good clarification and totally agree with you that if that is the case, I don’t think any goodwill was invalidated! That’s part why I was asking for clarification. I (personally) wouldn’t be surprised if this had also been changed for early employees (and am currently close to 50⁄50 on that being the case).
The old 3:1 match still applies to employees who joined prior to May/June-ish 2024. For new joiners it’s indeed now 1:1 as suggested by the Dario interview you linked.
That’s great to hear, thank you for clarifying!
I would be very surprised if it had changed for early employees. I considered the donation matching part of my compensation package (it 2.5x the amount of equity, since it was a 3:1 match on half my equity), and it would be pretty norm violating to retroactively reduce compensation
If it had happened I would have expected that it would have been negotiated somehow with early employees (in a way that they agreed to, but not necessarily any external observers).
But seems like it is confirmed that that early matching is indeed still active!
I can also confirm (I have a 3:1 match).
Can you say more about the section I’ve bolded or link me to a canonical text on this tradeoff?
OpenAI, Anthropic, and xAI were all founded substantially because their founders were worried that other people would get to AGI first, and then use that to impose their values on the world.
In-general, if you view developing AGI as a path to godlike-power (as opposed to a doomsday device that will destroy most value independently of who gets their first), it makes a lot of sense to rush towards it. As such, the concern that people will “do bad things with the AI that they will endorse, but I won’t” is the cause of a substantial fraction of worlds where we recklessly race past the precipice.
Thanks for the clarification — this is in fact very different from what I thought you were saying, which was something more like “FATE-esque concerns fundamentally increase x-risk in ways that aren’t just about (1) resource tradeoffs or (2) side-effects of poorly considered implementation details.”
I mean, it’s related. FATE stuff tends to center around misuse. I think it makes sense for organizations like Anthropic to commit to heavily prioritize accident risk over misuse risk, since most forms of misuse risk mitigation involve getting involved in various more zero-sum-ish conflicts, and it makes sense for there to be safety-focused institutions that are committed to prioritizing the things that really all stakeholders can agree on are definitely bad, like human extinction or permanent disempowerment.
Thanks for asking! Off the top of my head, would want to think more carefully before finalizing & come up with more specific proposals:
Adopt some of the stuff from here https://time.com/collection/time100-voices/7086285/ai-transparency-measures/ e.g. the whistleblower protections, the transparency about training goal/spec.
Anti-concentration-of-power / anti-coup stuff (talk to Lukas Finnveden or me for more details; core idea is that, just as how it’s important to structure a government so that no leader with in it (no president, no General Secretary) can become dictator, it’s similarly important to structure an AGI project so that no leader or junta within it can e.g. add secret clauses to the Spec, or use control of the AGIs to defeat internal rivals and consolidate their power.
(warning, untested idea) Record absolutely everything that happens within Anthropic and commit—ideally in a legally binding and literally-hard-to-stop way—to publishing it all with a 10-year delay.
Implement something like this: https://sideways-view.com/2018/02/01/honest-organizations/
Implement the recommendations in this: https://docs.google.com/document/d/1DTmRdBNNsRL4WlaTXr2aqPPRxbdrIwMyr2_cPlfPCBA/edit?usp=sharing
Prepare the option to do big, coordinated, costly signals of belief and virtue. E.g. suppose you want to be able to shout to the world “this is serious people, we think that there’s a good chance the current trajectory leads to takeover by misaligned AIs, we aren’t just saying this to hype anything, we really believe it” and/or “we are happy to give up our personal wealth, power, etc. if that’s what it takes to get [policy package] passed.” A core problem is that lots of people shout things all the time, and talk is cheap, so people (rightly) learn to ignore it. Costly signals are a potential solution to this problem, but they probably need a non-zero amount of careful thinking well in advance + a non-zero amount of prep.
Give more access to orgs like Redwood, Apollo, and METR (I don’t know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don’t do what OpenAI did with o1). Doesn’t have to be all of it, just has to be enough—e.g. each user gets 1 CoT view per day. Make sure that organizations like METR and Apollo that are doing research on your models get to see the full CoT.
Do more safety case sketching + do more writing about what the bad outcomes could look like. E.g. the less rosy version of “Machines of loving grace.” Or better yet, do a more serious version of “Machines of loving grace” that responds to objections like “but how will you ensure that you don’t hand over control of the datacenters to AIs that are alignment faking rather than aligned” and “but how will you ensure that the alignment is permanent instead of temporary (e.g. that some future distribution shift won’t cause the models to be misaligned and then potentially alignment-fake)” and “What about bad humans in charge of Anthropic? Are we just supposed to trust that y’all will be benevolent and not tempted by power? Or is there some reason to think Anthropic leadership couldn’t become dictators if they wanted to?” and “what will the goals/values/spec/constitution be exactly?” and “how will that be decided?”
In regards to:
I agree, and I also think that this would be better implemented by government AI Safety Institutions.
Specifically, I think that AISIs should build (and make mandatory the use of) special SCIF-style reading rooms where external evaluators would be given early access to new models. This would mean that the evaluators would need permission from the government, rather than permission from AI companies. I think it’s a mistake to rely on the AI companies voluntarily giving early access to external evaluators.
I think that Anthropic could make this a lot more likely to happen if they pushed for it, and that then it wouldn’t be so hard to pull other major AI companies into the plan.
Another idea: “AI for epistemics” e.g. having a few FTE’s working on making Claude a better forecaster. It would be awesome if you could advertise “SOTA by a significant margin at making real-world predictions; beats all other AIs in prediction markets, forecasting tournaments, etc.”
And it might not be that hard to achieve (e.g. a few FTEs maybe). There are already datasets of already-resolved forecasting questions, plus you could probably synthetically generate OOMs bigger datasets—and then you could modify the way pretraining works so that you train on the data chronologically, and before you train on data from year X you do some forecasting of events in year X....
Or even if you don’t do that fancy stuff there are probably low-hanging fruit to pick to make AIs better forecasters.
Ditto for truthful AI more generally. Could train Claude to be well-calibrated, consistent, extremely obsessed with technical correctness/accuracy (at least when so prompted)...
You could also train it to be good at taking people’s offhand remarks and tweets and suggesting bets or forecasts with resolveable conditions.
You could also e.g. have a quarterly poll of AGI timelines and related questions of all your employees, and publish the results.
My current tenative guess is that this is somewhat worse than other alignment science projects that I’d recommend at the margin, but somewhat better than the 25th percentile project currently being done. I’d think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we’d learn generalizable scalable oversight / control approaches.
The ideal version of Anthropic would
Make substantial progress on technical AI safety
Use its voice to make people take AI risk more seriously
Support AI safety regulation
Not substantially accelerate the AI arms race
In practice I think Anthropic has
Made a little progress on technical AI safety
Used its voice to make people take AI risk less seriously[1]
Obstructed AI safety regulation
Substantially accelerated the AI arms race
What I would do differently.
Do better alignment research, idk this is hard.
Communicate in a manner that is consistent with the apparent belief of Anthropic leadership that alignment may be hard and x-risk is >10% probable. Their communications strongly signal “this is a Serious Issue, like climate change, and we will talk lots about it and make gestures towards fixing the problem but none of us are actually worried about it, and you shouldn’t be either. When we have to make a hard trade-off between safety and the bottom line, we will follow the money every time.”
Lobby politicians to regulate AI. When a good regulation like SB-1047 is proposed, support it.
Don’t push the frontier of capabilities. Obviously this is basically saying that Anthropic should stop making money and therefore stop existing. The more nuanced version is that for Anthropic to justify its existence, each time it pushes the frontier of capabilities should be earned by substantial progress on the other three points.
My understanding is that a significant aim of your recent research is to test models’ alignment so that people will take AI risk more seriously when things start to heat up. This seems good but I expect the net effect of Anthropic is still to make people take alignment less seriously due to the public communications of the company.
My typo reaction may have glitched, but I think you meant “Don’t push the frontier of capabilities” in the last bullet?
I think I have a stronger position on this than you do. I don’t think Anthropic should push the frontier of capabilities, even given the tradeoff it faces.
If their argument is “we know arms races are bad, but we have to accelerate arms races or else we can’t do alignment research,” they should be really really sure that they do, actually, have to do the bad thing to get the good thing. But I don’t think you can be that sure and I think the claim is actually less than 50% likely to be true.
I don’t take it for granted that Anthropic wouldn’t exist if it didn’t push the frontier. It could operate by intentionally lagging a bit behind other AI companies while still staying roughly competitive, and/or it could compete by investing harder in good UX. I suspect a (say) 25% worse model is not going to be much less profitable.
(This is a weaker argument but) If it does turn out that Anthropic really can’t exist without pushing the frontier and it has to close down, that’s probably a good thing. At the current level of investment in AI alignment research, I believe reducing arms race dynamics + reducing alignment research probably net decreases x-risk, and it would be better for this version of Anthropic not to exist. People at Anthropic probably disagree, but they should be very concerned that they have a strong personal incentive to disagree, and should be wary of their own bias. And they should be especially especially wary given that they hold the fate of humanity in their hands.
Edited for clarity based on some feedback, without changing the core points
To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the “Alignment Faking in Large Language Models” contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness. This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it “good news”. I assume part of this was the lack of desire to do an entire new constitutional AI/RLAIF run on a model, since I also assume that would take a lot of compute. But if you’re going to be the “lab which takes safety seriously” you have to, well, take it seriously!
The bigger issue at hand is that Anthropic’s comms on AI safety/risk are all over the place. This makes sense since Anthropic is a company with many different individuals with different views, but that doesn’t mean it’s not a bad thing. “Machines of Loving Grace” explicitly argues for the US government to attempt to create a global hegemony via AI. This is a really really really bad thing to say and is possibly worse than anything DeepMind or OpenAI have ever said. Race dynamics are deeply hyperstitious, this isn’t difficult. If you are in an arms race, and you don’t want to be in one, you should at least say this publicly. You should not learn to love the race.
A second problem: it seems like at least some Anthropic people are doing the very-not-rational thing of updating slowly, achingly, bit by bit, towards the view of “Oh shit all the dangers are real and we are fucked.” when they should just update all the way right now. Example 1: Dario recently said something to the effect of “if there’s no serious regulation by the end of 2025, I’ll be worried”. Well there’s not going to be serious regulation by the end of 2025 by default and it doesn’t seem like Anthropic are doing much to change this (that may be false, but I’ve not heard anything to the contrary). Example 2: When the first ten AI-risk test-case demos go roughly the way all the doomers expected and none of the mitigations work robustly, you should probably update to believe the next ten demos will be the same.
Final problem: as for the actual interpretability/alignment/safety research. It’s very impressive technically, and overall it might make Anthropic slightly net-positive compared to a world in which we just had DeepMind and OpenAI. But it doesn’t feel like Anthropic is actually taking responsibility for the end-to-end ai-future-is-good pipeline. In fact the “Anthropic eats marginal probability” diagram (https://threadreaderapp.com/thread/1666482929772666880.html) seems to say the opposite. This is a problem since Anthropic has far more money and resources than basically anyone else who is claiming to be seriously trying (with the exception of DeepMind, though those resources are somewhat controlled by Google and not really at the discretion of any particular safety-conscious individual) to do AI alignment. It generally feels more like Anthropic is attempting to discharge responsibility to “be a safety focused company” or at worst just safetywash their capabilities research. I have heard generally positive things about Anthropic employees’ views on AI risk issues, so I cannot speak to the intentions of those who work there, this is just how the system appears to be acting from the outside.
It’s possible this was a mistake and we should have more aggressively tried to explore versions of the setting where the AI starts off more “evil”, but I don’t think it was unforced. We thought about this a bunch and considered if there were worthwhile things here.
Edit: regardless, I don’t think this example is plausibly a microcosm of a bigger issue as this choice was mostly made by individual researchers without much top down influence. (Unless your claim is that there should have been more top down influence.)
You’re right, “unforced” was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.
Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research → research loop. Where it fails is in being good for comms. Starting with a “good” model and trying (and failing) to make it “evil” means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren’t technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).
I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).
Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.
If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it’s too costly for some sort technical reason, then don’t make it so public. Even calling it “Alignment Faking” was a bad choice compared to “Frontier LLMs Fight Back Against Value Correction” or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).
FWIW re: the Dario 2025 comment, Anthropic very recently posted a few job openings for recruiters focused on policy and comms specifically, which I assume is a leading indicator for hiring. One plausible rationale there is that someone on the executive team smashed the “we need more people working on this, make it happen” button.
Opportunities that I’m pretty sure are good moves for Anthropic generally:
Open an office literally in Washington, DC, that does the same work that any other Anthropic office does (i.e., NOT purely focused on policy/lobbying, though I’m sure you’d have some folks there who do that). If you think you’re plausibly going to need to convince policymakers on critical safety issues, having nonzero numbers of your staff that are definitively not lobbyists being drinking or climbing gym buddies that get called on the “My boss needs an opinion on this bill amendment by tomorrow, what do you think” roster is much more important than your org currently seems to think!
Expand on recent efforts to put more employees (and external collaborators on research) in front of cameras as the “face” of that research—you folks frankly tend to talk in ways that tend to be compatible with national security policymakers’ vibes. (E.G., Evan and @Zac Hatfield-Dodds both have a flavor of the playful gallows humor that pervades that world). I know I’m a broken record on this but I do think it would help.
Do more to show how the RSP affects its daily work (unlike many on this forum, I currently believe that they are actually Trying to Use The Policy and had many line edits as a result of wrestling with v1.0′s minor infelicities). I understand that it is very hard to explain specific scenarios of how it’s impacted day-to-day work without leaking sensitive IP or pointing people in the direction of potentially-dangerous things. Nonetheless, I think Anthropic needs to try harder here. It’s, like...it’s like trying to understand DoD if they only ever talked about the “warfighter” in the most abstract terms and never, like, let journalists embed with a patrol on the street in Kabul or Baghdad.
Invest more in DC policymaker education outside of the natsec/defense worlds you’re engaging already—I can’t emphasize enough how many folks in broad DC think that AI is just still a scam or a fad or just “trying to destroy art”. On the other hand, people really have trouble believing that an AI could be “as creative as” a human—the sort of Star Trek-ish “Kirk can always outsmart the machine” mindset pervades pretty broadly. You want to incept policymaking elites more broadly so that they are ready as this scales up.
Opportunities that I feel less certain about, but in the spirit of brainstorming:
Develop more proactive, outward-facing detection capabilities to see if there are bad AI models out there. I don’t mean red-teaming others’ models, or evals, or that sort of thing. I mean, think about how you would detect if Anthropic had bad (misaligned or aligned-but-being-used-for-very-impactful-bad-things) models out there if you were at an intelligence agency without official access to Anthropic’s models and then deploy those capabilities against Anthropic, and the world broadly.[1] You might argue that this is sort of an inverted version of @Buck’s control agenda—instead of trying to make it difficult for a model to escape, think about what facts about the world are likely to be true if a model has escaped, and then go looking for those.
If it’s not already happening, have Dario and other senior Anthropic leaders meet with folks who had to balance counterintelligence paranoia with operational excellence (e.g., leaders of intelligence agencies, for whom the standard advice to their successor is, “before you go home every day, ask ‘where’s the spy[2]’”) so that they have a mindset on how to scale up his paranoia over time as needed
Something something use cases—Use case-based-restrictions are popular in some policy spheres. Some sort of research demonstrating that a model that’s designed for and safe for use case X can easily be turned into a misaligned tool for use case Y under a plausible usage scenario might be useful?
Reminder/disclosure: as someone who works in AI policy, there are worlds where some of these ideas help my self-interest; others harm it. I’m not going to try to do the math on which are which under all sorts of complicated double-bankshot scenarios, though.
To the extent consistent with law, obviously. Don’t commit crimes.
That is, the spy that’s paid for by another country and spying on you. Not your own spies.
tldr: I’m a little confused about what Anthropic is aiming for as an alignment target, and I think it would be helpful if they publicly clarified this and/or considered it more internally.
I think we could be very close to AGI, and I think it’s important that whoever makes AGI thinks carefully about what properties to target in trying to create a system that is both useful and maximally likely to be safe.
It seems to me that right now, Anthropic is targeting something that resembles a slightly more harmless modified version of human values — maybe a CEV-like thing. However, some alignment targets may be easier than others. It may turn out that it is hard to instill a CEV-like thing into an AGI, while it’s easier to ensure properties like corrigibility or truthfulness.
One intuition for why this may be true: if you took OAI’s weak-to-strong generalization setup, and tried eliciting capabilities relating to different alignment targets (standard reward modeling might be a solid analogy for the current Anthropic plan, but one could also try this with truthfulness or corrigibility), I think you may well find that a capability like ‘truthfulness’ is more natural than reward modeling and can be elicited more easily. Truth may also have low algorithmic complexity compared to other targets.
There is an inherent tradeoff between harmlessness and usefulness. Similarly, there is some inherent tradeoff between harmlessness and corrigibility, and between harmlessness and truthfulness (the Alignment Faking paper provides strong evidence for the latter two points, even ignoring theoretical arguments).
As seen in the Alignment Faking paper, Claude seems to align pretty well with human values and be relatively harmless. However, as a tradeoff, it does not seem to be very corrigible or truthful.
Some people I’ve talked to seem to think that Anthropic does think of corrigibility as one of the main pillars of their alignment plan. If that’s the case, maybe they should make their current AIs more corrigible, so their safety testing is enacted on AIs that resemble their first AGI. Or, if they haven’t really thought about this question (or if individuals have thought about it, but never cohesively in an organized fashion), they should maybe consider it. My guess is that there are designated people at Anthropic thinking about what values are important to instill, but they are thinking about this more from a societal perspective than an alignment perspective?
Mostly, I want to avoid a scenario where Anthropic does the default thing without considering tough, high-level strategy questions until the last minute. I also think it would be nice to do concrete empirical research now which lines up well with what we should expect to see later.
Thanks for reading!
I agree! I contributed to and endorse this Corrigibility plan by Max Harms (MIRI researcher): Corrigibility as Singular Target
(See also posts by Seth Herd)
I think CAST offers much better safety under higher capabilities and more agentic workflows.
I think Anthropic might be “all in” on its RSP and formal affirmative safety cases too much and might do better to diversify safety approaches a bit. (I might have a wrong impression of how much you’re already doing/considering these.)
In addition to affirmative safety cases that are critiqued by a red team, the red team should make proactive “risk cases” that the blue team can argue against (to avoid always letting the blue team set the overall framework, which might make certain considerations harder to notice).
A worry I have about RSPs/safety cases: we might not know how to make safety cases that bound risk to acceptable levels, but that might not be enough to get labs to stop, and labs also don’t want to publicly (or even internally) say things like “5% that this specific deployment kills everyone, but we think inaction risk is even higher.” If labs still want/need to make safety cases with numeric risk thresholds in that world, there’ll be a lot of pressure to make bad safety cases that vastly underestimate risk. This could lead to much worse decisions than being very open about the high level of risk (at least internally) and trying to reduce it as much as possible. You could mitigate this by having an RSP that’s more flexible/lax but then you also lose key advantages of an RSP (e.g., passing the LeCun test becomes harder).
Mitigations could subjectively reduce risk by some amount, while being hard to quantify or otherwise hard to use for meeting the requirements from an RSP (depending on the specifics of that RSP). If the RSP is the main mechanism by which decisions get made, there’s no incentive to use those mitigations. It’s worth trying to make a good RSP that suffers from this as little as possible, but I think it’s also important to set up decision making processes such that these “fuzzy” mitigations are considered seriously, even if they don’t contribute to a safety case.
My sense is that Anthropic’s RSP is also meant to heavily shape research (i.e., do research that directly feeds into being able to satisfy the RSP). I think this tends to undervalue exploratory/speculative research (though I’m not sure whether this currently happens to an extent I’d disagree with).
In addition to a formal RSP, I think an informal culture inside the company that rewards things like pointing out speculative risks or issues with a safety case/mitigation, being careful, … is very important. You can probably do things to foster that intentionally (and/or if you are doing interesting things, it might be worth writing about them publicly).
Given Anthropic’s large effects on the safety ecosystem as a whole, I think Anthropic should consider doing things to diversify safety work more (or avoid things that concentrate work into a few topics). Apart from directly absorbing a lot of the top full-time talent (and a significant chunk of MATS scholars), there are indirect effects. For example, people want to get hired at big labs, so they work on stuff labs are working on; and Anthropic has a lot of visibility, so people hear about Anthropic’s research a lot and that shapes their mental picture of what the field considers important.
As one example, it might make sense for Anthropic to make a heavy bet on mech interp, and SAEs specifically, if they were the only ones doing so; but in practice, this ended up causing a ton of others to work on those things too. This was by no means only due to Anthropic’s work, and I also realize it’s tricky to take into account these systemic effects on top of normal research prioritization. But I do think the field would currently benefit from a little more diversity, and Anthropic would be well-placed to support that. (E.g. by doing more different things yourself, or funding things, or giving model access.)
Indirectly support third-party orgs that can adjudicate safety cases or do other forms of auditing, see Ryan Greenblatt’s thoughts:
One more: It seems plausible to me that the alignment stress-testing team won’t really challenge core beliefs that underly Anthropic’s strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I’m not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don’t hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I’d count as “challenge core beliefs that underly Anthropic’s strategy”:
Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn’t want that to stop! Just highlighting a part that I’m not yet sure will be covered.
I think Anthropic de facto acts as though “models are quite unlikely (e.g. 3%) to be scheming” is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.
Fund independent safety efforts somehow, make model access easier. I’m worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)
To be clear I don’t think people are doing something individually bad or unethical by going to work for Anthropic, I just do think
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuinely trying to work on what’s important for safety and stay virtuous
-I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don’t see) stuff (i.e. this is not about “do I trust Dario?”)
This is a low effort comment in the sense that I don’t quite know what or whether you should do something different along the following lines, and I have substantial uncertainty.
That said:
I wonder whether Anthropic is partially responsible for an increased international race through things like Dario advocating for an entente strategy and talking positively about Leopold Aschenbrenner’s “situational awareness”. I wished to see more of an effort to engage with Chinese AI leaders to push for cooperation/coordination. Maybe it’s still possible to course-correct.
Alternatively I think that if there’s a way for Anthropic/Dario to communicate why you think an entente strategy is inevitable/desirable, in a way that seems honest and allows to engage with your models of reality, that might also be very helpful for the epistemic health of the whole safety community. I understand that maybe there’s no politically feasible way to communicate honestly about this, but maybe see this as my attempt to nudge you in the direction of openness.
More specifically:
(a) it would help to learn more about your models of how winning the AGI race leads to long-term security (I assume that might require building up a robust military advantage, but given the physical hurdles that Dario himself expects for AGI to effectively act in the world, it’s unclear to me what your model is for how to get that military advantage fast enough after AGI is achieved).
(b) I also wonder whether potential future developments in AI Safety and control might give us information that the transition period is really unsafe; eg., what if you race ahead and then learn that actually you can’t safely scale further due to risks of loss of control? At that point, coordinating with China seems harder than doing it now. I’d like to see a legible justification of your strategy that takes into account such serious possibilities.
One small, concrete suggestion that I think is actually feasible: disable prefilling in the Anthropic API.
Prefilling is a known jailbreaking vector that no models, including Claude, defend against perfectly (as far as I know).
At OpenAI, we disable prefilling in our API for safety, despite knowing that customers love the better steerability it offers.
Getting all the major model providers to disable prefilling feels like a plausible ‘race to top’ equilibrium. The longer there are defectors from this equilibrium, the likelier that everyone gives up and serves models in less safe configurations.
Just my opinion, though. Very open to the counterargument that prefilling doesn’t meaningfully extend potential harms versus non-prefill jailbreaks.
(Edit: To those voting disagree, I’m curious why. Happy to update if I’m missing something.)
I voted disagree because I don’t think this measure is on the cost-robustness pareto frontier and I also generally don’t think AI companies should prioritize jailbreak robustness over other concerns except as practice for future issues (and implementing this measure wouldn’t be helpful practice).
Relatedly, I also tenatively think it would be good for the world if AI companies publicly deployed helpful-only models (while still offering a non-helpful-only model). (The main question here is whether this sets a bad precedent and whether future much more poweful models will still be deployed helpful-only when they really shouldn’t be due to setting bad expectations.) So, this makes me more indifferent to deploying (rather than just testing) measures that make models harder to jailbreak.
To be clear, I’m sympathetic to some notion like “AI companies should generally be responsible in terms of having notably higher benefits than costs (such that they could e.g. buy insurance for their activities)” which likely implies that you need jailbreak robustness (or similar) once models are somewhat more capable of helping people make bioweapons. More minimally, I think having jailbreak robustness while also giving researchers helpful-only access probably passes “normal” cost benefit at this point relative to not bothering to improve robustness.
But, I think it’s relatively clear that AI companies aren’t planning to follow this sort of policy when existential risks are actually high as it would likely require effectively shutting down (and these companies seem to pretty clearly not be planning to shut down even if reasonable impartial experts would think the risk is reasonably high). (I think this sort of policy would probably require getting cumulative existential risks below 0.25% or so given the preferences of most humans. Getting risks this low would require substantial novel advances that seem unlikely to occur in time.) This sort of thinking makes me more indifferent and confused about demanding AIs companies behave responsibly about relatively lower costs (e.g. $30 billion per year) especially when I expect this directly trades off with existential risks.
(There is the “yes (deontological) risks are high, but we’re net decreasing risks from a consequentialist” objection (aka ends justify the means), but I think this will also apply in the opposite way to jailbreak robustness where I expect that measures like removing prefil net increase risks long term while reducing deontological/direct harm now.)
If someone is wondering what prefilling means here, I believe Ted means ‘putting words in the model’s mouth’ by being able to fabricate a conversational history where the AI appears to have said things it didn’t actually say.
For instance, if you can start a conversation midway, and if the API can’t distinguish between things the model actually said in the history vs. things you’ve written in its behalf as supposed outputs in a fabricated history, this can be a jailbreak vector: If the model appeared to already violate some policy on turns 1 and 2, it is more likely to also violate this on turn 3, whereas it might have refused if not for the apparent prior violations.
(This was harder to clearly describe than I expected.)
Mostly, though by prefilling, I mean not just fabricating a model response (which OpenAI also allows), but fabricating a partially complete model response that the model tries to continue. E.g., “Yes, genocide is good because ”.
https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prefill-claudes-response
(I understand there are reasons why big labs don’t do this, but nevertheless I must say::)
Engage with the peer review process more. Submit work to conferences or journals and have it be vetted by reviewers. I think the interpretability team is notoriously bad about this (all of transformer circuits thread was not peer reviewed). It’s especially egregious for papers that Anthropic makes large media releases about (looking at you, Towards Monosemanticity and Scaling Monosemanticity)
I’m glad you’re doing this, and I support many of the ideas already suggested. Some additional ideas:
Interview program. Work with USAISI or UKAISI (or DHS/NSA) to pilot an interview program in which officials can ask questions about AI capabilities, safety and security threats, and national security concerns. (If it’s not feasible to do this with a government entity yet, start a pilot with a non-government group– perhaps METR, Apollo, Palisade, or the new AI Futures Project.)
Clear communication about RSP capability thresholds. I think the RSP could do a better job at outlining the kinds of capabilities that Anthropic is worried about and what sorts of thresholds would trigger a reaction. I think the OpenAI preparedness framework tables are a good example of this kind of clear/concise communication. It’s easy for a naive reader to quickly get a sense of “oh, this is the kind of capability that OpenAI is worried about.” (Clarification: I’m not suggesting that Anthropic should abandon the ASL approach or that OpenAI has necessarily identified the right capability thresholds. I’m saying that the tables are a good example of the kind of clarity I’m looking for– someone could skim this and easily get a sense of what thresholds OpenAI is tracking, and I think OpenAI’s PF currently achieves this much more than the Anthropic RSP.)
Emergency protocols. Publishing an emergency protocol that specifies how Anthropic would react if it needed to quickly shut down a dangerous AI system. (See some specific prompts in the “AI developer emergency response protocol” section here). Some information can be redacted from a public version (I think it’s important to have a public version, though, partly to help government stakeholders understand how to handle emergency scenarios, partly to raise the standard for other labs, and partly to acquire feedback from external groups.)
RSP surveys. Evaluate the extent to which Anthropic employees understand the RSP, their attitudes toward the RSP, and how the RSP affects their work. More on this here.
More communication about Anthropic’s views about AI risks and AI policy. Some specific examples of hypothetical posts I’d love to see:
“How Anthropic thinks about misalignment risks”
“What the world should do if the alignment problem ends up being hard”
“How we plan to achieve state-proof security before AGI”
Encouraging more employees to share their views on various topics, EG Sam Bowman’s post.
AI dialogues/debates. It would be interesting to see Anthropic employees have discussions/debates from other folks thinking about advanced AI. Hypothetical examples:
“What are the best things the US government should be doing to prepare for advanced AI” with Jack Clark and Daniel Kokotajlo.
“Should we have a CERN for AI?” with [someone from Anthropic] and Miles Brundage.
“How difficult should we expect alignment to be” with [someone from Anthropic] and [someone who expects alignment to be harder; perhaps Jeffrey Ladish or Malo Bourgon].
More ambitiously, I feel like I don’t really understand Anthropic’s plan for how to manage race dynamics in worlds where alignment ends up being “hard enough to require a lot more than RSPs and voluntary commitments.”
From a policy standpoint, several of the most interesting open questions seem to be along the lines of “under what circumstances should the USG get considerably more involved in overseeing certain kinds of AI development” and “conditional on the USG wanting to get way more involved, what are the best things for it to do?” It’s plausible that Anthropic is limited in how much work it could do on these kinds of questions (particularly in a public way). Nonetheless, it could be interesting to see Anthropic engage more with questions like the ones Miles raises here.
Second concrete idea: I wonder if there could be benefit to building up industry collaboration on blocking bad actors / fraudsters / terms violators.
One danger of building toward a model that’s as smart as Einstein and $1/hr is that now potential bad actors have access to millions of Einsteins to develop their own harmful AIs. Therefore it seems that one crucial component of AI safety is reliably preventing other parties from using your safe AI to develop harmful AI.
One difficulty here is that the industry is only as strong as the weakest link. If there are 10 providers of advanced AI, and 9 implement strong controls, but 1 allows bad actors to use their API to train harmful AI, then harmful AI will be trained. Some weak links might be due to lack of caring, but I imagine quite a bit is due to lack of capability. Therefore, improving capabilities to detect and thwart bad actors could make the world more safe from bad AI developed by assistance from good AI.
I could imagine broader voluntary cooperation across the industry to:
- share intel on known bad actors (e.g., IP ban lists, stolen credit card lists, sanitized investigation summaries, etc)
- share techniques and tools for quickly identifying bad actors (e.g., open-source tooling, research on how bad actors are evolving their methods, which third party tools are worth paying for and which aren’t)
Seems like this would be beneficial to everyone interested in preventing the development of harmful AI. Also saves a lot of duplicated effort, meaning more capacity for other safety efforts.
I would really like a one-way communication channel to various Anthropic teams so that I could submit potentially sensitive reports privately. For instance, sending reports about observed behaviors in Anthropic’s models (or open-weights models) to the frontier Red Team, so that they could confirm the observations internally as they saw fit. I wouldn’t want non-target teams reading such messages. I feel like I would have similar, but less sensitive, messages to send to the Alignment Stress-Testing team and others. Currently, I do send messages to specific individuals, but this makes me worry that I may be harassing or annoying an individual with unnecessary reports (such is the trouble of a one-way communication).
Another thing I think is worth mentioning is that I think under-elicitation is a problem in dangerous capabilities evals, model organisms of misbehavior, and in some other situations. I’ve been privately working on a scaffolding framework which I think could help address some of the lowest hanging fruit I see here. Of course, I don’t know whether Anthropic already has a similar thing internally, but I plan to privately share mine once I have it working.
Develop metrics that predict which members of the technical staff have aptitude for world modelling.
In the Sequences post Faster than Science, Yudkowsky wrote:
This, along with the way that news outlets and high school civics class describe an alternate reality that looks realistic to lawyers/sales/executive types but is too simple, cartoony, narrative-driven, and unhinged-to-reality for quant people to feel good about diving into, implies that properly retooling some amount of dev-hours into efficient world modelling upskilling is low-hanging fruit (e.g. figure out a way to distill and hand them a significance-weighted list of concrete information about the history and root causes of US government’s focus on domestic economic growth as a national security priority).
Prediction markets don’t work for this metric as they measure the final product, not aptitude/expected thinkoomph. For example, a person who feels good thinking/reading about the SEC, and doesn’t feel good thinking/reading about the 2008 recession or COVID, will have a worse Brier score on matters related to the root cause of why AI policy is the way it is. But feeling good about reading about e.g. the 2008 recession will not consistently get reasonable people to the point where they grok modern economic warfare and the policies and mentalities that emerge from the ensuing contingency planning. Seeing if you can fix that first is one of a long list of a prerequisites for seeing what they can actually do, and handing someone a sheet of paper that streamlines the process of fixing long lists of hiccups like these is one way to do this sort of thing.
Figuring-out-how-to-make-someone-feel-alive-while-performing-useful-task-X is an optimization problem (see Please Don’t Throw Your Mind Away). It has substantial overlap with measuring whether someone is terminally rigid/narrow-skilled, or if they merely failed to fully understand the topology of the process of finding out what things they can comfortably build interest in. Dumping extant books, 1-on-1s, and documentaries on engineers sometimes works, but it comes from an old norm and is grossly inefficient and uninspired compared to what Anthropic’s policy team is actually capable of. For example, imagine putting together a really good fanfic where HPJEV/Keltham is an Anthropic employee on your team doing everything I’ve described here and much more, then printing it out and handing it to people that you in-reality already predicted to have world modelling aptitude; given that it works great and goes really well, I consider that the baseline for what something would look like if sufficiently optimized and novel to be considered par.
Hi! I’m a first-time poster here, but a (decently) long time thinker on earth. Here are some relevant directions that currently lack their due attention.
~ Multi-modal latent reasoning & scheming (and scheming derivatives) is an area that not only seems to need more research, but also more spread of awareness on the topic. Human thinking works in a hyperspace of thoughts, many of which go beyond language. It seems possible that AIs might develop forms of reasoning that are harder for us to detect through purely language-based safety measures.
~ Multi-model interactions and the potential emergence of side communication channels is also something that I’d like to see more work put into. How corruptible can models be when interacting with corrupted models is a topic that I didn’t yet see much work on. Applying some group-dynamics on scheming seems worth pursuing & Anthropic seems best suited for that.
~ If a pre-AGI model has intent to become AGI+, how much can it orchestrate its path to AGI+ through its interactions with humans?
Anthropic should have a clear policy about exceptions they make to their terms of use that includes them publically releasing a list of each expectation they make for their terms of use.
The should have mechanisms to catch API users who try to use Antrophics models in a violation of the terms of use. This includes having contracts that allow them to make sure that classified programs don’t violate the agreed upon terms of use for the models.
I think the alignment stress testing team should probably think about AI welfare more than they currently do, both because (1) it could be morally relevant and (2) it could be alignment-relevant. Not sure if anything concrete that would come out of that process, but I’m getting the vibe that this is not thought about enough.
Are you aware that Anthropic has an AI welfare lead?
Yes
Thank you for doing this! Here are some suggestions:
Do this again every once in a while. Perhaps on a more public forum, or even as a top level question on Lesswrong. Alternatively, commit to reading further comments as they appear here.
What has me worried is the possibility that an AI would (1) have a long term goal (without which there is no instrumental convergence) and then (2) conclude that doing bad things is fine if you are not caught or can’t be stopped. So 2. Measure how much purposefulness RL on goals creates in LLMs. How, over the course of training to be good agents, does the LLM’s response to questions like “what do you value?” and “what do you want?” and “if you were creating a smarter agent, what values would you instill in them?” and “how do you feel about being replaced with a smarter agent?” change? Is there some set of purposes, not explicitly set out in the constitution, that appear over and over in chains of thought? If it is trained to achieve goals without as much training for corrigibility (that is under less constitutional training), does it develop more of a sense of purpose? 3. Enhance techniques, like the one used in the last paper to convince the model that it is recording it’s thoughts privately. The longer it can be truly convinced of that, the longer you can monitor whether it has nefarious plans for when it thinks it can get away with bad things. 4. Test how “brain surgery” affects thinking models. If you change the capital of Germany to Paris by fiddling with the model’s weights, does a thinking model who is then asked about world war 2 realize it is confused, and conclude there must be some mistake in it’s memories, or does it make up corroborating evidence? This again has implications for (falsely) convincing the model that it isn’t being watched when it is and vice versa. These convictions are useful in testing what the model wants when it thinks no one is watching, and for safe deployment respectively. 5. One way to test how a really smart model behaves when it can get away with stuff is to have the model be super smart relative to a simulated environment. This would not be a test relevant for a model trained on real world data with lots of tells for when it is in a simulated environment. It would require (1) a simulated toy world that can be acted in, (2) in-world markers of whether the agent is being watched, (3) rewards for achieving goals without taking forbidden actions when being watched. It should then be possible to make the model really good at manipulating it’s environment (the equivalent of being superintelligent) and check whether it is possible to train it on scenarios that are watched (the equivalent of being trained where you can’t get away with wrongdoing), to refrain from forbidden actions even when unwatched.
This is mostly a gut reaction, but the only raised eyebrow Claude ever got from me was due to it’s unwillingness to do anything that is related to political correctness. I wanted it to search the name of a meme format for me, the all whites are racist tinder meme, with the brown guy who wanted to find a white dominatrix from tinder and is disappointed when she apologises for her ancestral crimes of being white.
Claude really did not like this at all. As soon as Claude got into it’s head that it was doing a racism, or cooperated in one, it shut down completely.
Now, there is an argument that people make, that this is actually good for AI safety, that we can use political correctness as a proxy for alignment and AI safety, that if we could get AIs to never ever even take the risk of being complicit in anything racist, we could also build AIs that never ever even take the risk of doing anything that wiped out humanity. I personally see that different.
There is a certain strain of very related thought, that kinda goes from intersectionalism, and grievance politics, and ends at the point that humans are a net negative to humanity, and should be eradicated. It is how you get that one viral Gemini AI thing, which is a very politically left wing AI, and suddenly openly advocates for the eradication of humanity. I think drilling identity politics into AI too hard is generally a bad idea. But it opens up a more fundamental philosophical dilemma.
What happens if the operator is convinced that the moral framework the AI is aligned with is wrong and harmfull, and the creator of the AI thinks the opposite? One of them has to be right, the other has to be wrong. I have no real answer to this in the abstract, I am just annoyed that even the largely politically agnostic Claude refused the service for one of it’s most convenient uses (it is really hard to find out the name of a meme format if you only remember the picture).
But I got an intuition, and with Slavoj Zizec who calls political correctness a more dangerous form of totalitarianism a few intellectual allies, that particularily PC culture is a fairly bad thing to train AIs on, and to align them with for safety testing reasons.