I’m interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing’s core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren’t taking—so I’m interested in hearing pretty much any idea there that I haven’t heard before.[1] I’ll read all the responses here, but I probably won’t reply to any of them to avoid revealing anything private.
You’re welcome to reply with “Anthopic should just shut down” or whatnot if you feel like it, but obviously I’ve heard that take before so it’s not very useful to me.
Use its voice to make people take AI risk more seriously
Support AI safety regulation
Not substantially accelerate the AI arms race
In practice I think Anthropic has
Made a little progress on technical AI safety
Used its voice to make people take AI risk less seriously[1]
Obstructed AI safety regulation
Substantially accelerated the AI arms race
What I would do differently.
Do better alignment research, idk this is hard.
Communicate in a manner that is consistent with the apparent belief of Anthropic leadership that alignment may be hard and x-risk is >10% probable. Their communications strongly signal “this is a Serious Issue, like climate change, and we will talk lots about it and make gestures towards fixing the problem but none of us are actually worried about it, and you shouldn’t be either. When we have to make a hard trade-off between safety and the bottom line, we will follow the money every time.”
Lobby politicians to regulate AI. When a good regulation like SB-1047 is proposed, support it.
Don’t push the frontier of capabilities. Obviously this is basically saying that Anthropic should stop making money and therefore stop existing. The more nuanced version is that for Anthropic to justify its existence, each time it pushes the frontier of capabilities should be earned by substantial progress on the other three points.
My understanding is that a significant aim of your recent research is to test models’ alignment so that people will take AI risk more seriously when things start to heat up. This seems good but I expect the net effect of Anthropic is still to make people take alignment less seriously due to the public communications of the company.
Don’t push the frontier of regulations. Obviously this is basically saying that Anthropic should stop making money and therefore stop existing. The more nuanced version is that for Anthropic to justify its existence, each time it pushes the frontier of capabilities should be earned by substantial progress on the other three points.
I think I have a stronger position on this than you do. I don’t think Anthropic should push the frontier of capabilities, even given the tradeoff it faces.
If their argument is “we know arms races are bad, but we have to accelerate arms races or else we can’t do alignment research,” they should be really really sure that they do, actually, have to do the bad thing to get the good thing. But I don’t think you can be that sure and I think the claim is actually less than 50% likely to be true.
I don’t take it for granted that Anthropic wouldn’t exist if it didn’t push the frontier. It could operate by intentionally lagging a bit behind other AI companies while still staying roughly competitive, and/or it could compete by investing harder in good UX. I suspect a (say) 25% worse model is not going to be much less profitable.
(This is a weaker argument but) If it does turn out that Anthropic really can’t exist without pushing the frontier and it has to close down, that’s probably a good thing. At the current level of investment in AI alignment research, I believe reducing arms race dynamics + reducing alignment research probably net decreases x-risk, and it would be better for this version of Anthropic not to exist. People at Anthropic probably disagree, but they should be very concerned that they have a strong personal incentive to disagree, and should be wary of their own bias. And they should be especially especially wary given that they hold the fate of humanity in their hands.
Anthropic’s marginal contribution to safety (compared to what we would have in a world without Anthropic) probably doesn’t offset Anthropic’s contribution to the AI race.
I think there are more worlds where Anthropic is contributing to the race in a negative fashion than there are worlds where Anthropic’s marginal safety improvement over OpenAI/DeepMind-ish orgs is critical for securing a good future with AGI (weighing things according to the impact sizes and probabilities).
Anthropic should publicly clarify the commitments it made on not pushing the state of the art forward in the early years of the organization.
Anthropic should appoint genuinely independent members to the Long Term Benefit Trust, and should ensure the LTBT is active and taking its role of supervision seriously.
Anthropic should remove any provisions that allow shareholders to disempower the LTBT
Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGI
Anthropic should publicly state its opinions on what AGI architectures or training processes it considers more dangerous (like probably long-horizon RL training), and either commit to avoid using those architectures and training-processes, or at least very loudly complain that the field at large should not use those architectures
Anthropic should not ask employees or contractors to sign non-disparagement agreements with Anthropic, especially not self-cloaking ones
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
Anthropic should do more collaborations like the one you just did with Redwood, where external contractors get access to internal models. I think this is of course infosec-wise hard, but I think you can probably do better than you are doing right now.
Anthropic should publicly clarify what the state of its 3:1 equity donation matching program is, which it advertisedpublicly (and which played a substantial role in many people external to Anthropic supporting it, given that they expected a large fraction of the equity to therefore be committed to charitable purposes). Recent communications suggest any equity matching program at Anthropic does not fit what was advertised.
Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGIwhi
I’ll echo this and strengthen it to:
… call for policymakers to stop the development of AGI.
I gather that they changed the donation matching program for future employees, but the 3:1 match still holds for prior employees, including all early employees (this change happened after I left, when Anthropic was maybe 50 people?)
I’m sad about the change, but I think that any goodwill due to believing the founders have pledged much of their equity to charity is reasonable and not invalidated by the change
If it still holds for early employees that would be a good clarification and totally agree with you that if that is the case, I don’t think any goodwill was invalidated! That’s part why I was asking for clarification. I (personally) wouldn’t be surprised if this had also been changed for early employees (and am currently close to 50⁄50 on that being the case).
The old 3:1 match still applies to employees who joined prior to May/June-ish 2024. For new joiners it’s indeed now 1:1 as suggested by the Dario interview you linked.
I would be very surprised if it had changed for early employees. I considered the donation matching part of my compensation package (it 2.5x the amount of equity, since it was a 3:1 match on half my equity), and it would be pretty norm violating to retroactively reduce compensation
If it had happened I would have expected that it would have been negotiated somehow with early employees (in a way that they agreed to, but not necessarily any external observers).
But seems like it is confirmed that that early matching is indeed still active!
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
Can you say more about the section I’ve bolded or link me to a canonical text on this tradeoff?
OpenAI, Anthropic, and xAI were all founded substantially because their founders were worried that other people would get to AGI first, and then use that to impose their values on the world.
In-general, if you view developing AGI as a path to godlike-power (as opposed to a doomsday device that will destroy most value independently of who gets their first), it makes a lot of sense to rush towards it. As such, the concern that people will “do bad things with the AI that they will endorse, but I won’t” is the cause of a substantial fraction of worlds where we recklessly race past the precipice.
Thanks for the clarification — this is in fact very different from what I thought you were saying, which was something more like “FATE-esque concerns fundamentally increase x-risk in ways that aren’t just about (1) resource tradeoffs or (2) side-effects of poorly considered implementation details.”
I mean, it’s related. FATE stuff tends to center around misuse. I think it makes sense for organizations like Anthropic to commit to heavily prioritize accident risk over misuse risk, since most forms of misuse risk mitigation involve getting involved in various more zero-sum-ish conflicts, and it makes sense for there to be safety-focused institutions that are committed to prioritizing the things that really all stakeholders can agree on are definitely bad, like human extinction or permanent disempowerment.
Anti-concentration-of-power / anti-coup stuff (talk to Lukas Finnveden or me for more details; core idea is that, just as how it’s important to structure a government so that no leader with in it (no president, no General Secretary) can become dictator, it’s similarly important to structure an AGI project so that no leader or junta within it can e.g. add secret clauses to the Spec, or use control of the AGIs to defeat internal rivals and consolidate their power.
(warning, untested idea) Record absolutely everything that happens within Anthropic and commit—ideally in a legally binding and literally-hard-to-stop way—to publishing it all with a 10-year delay.
Prepare the option to do big, coordinated, costly signals of belief and virtue. E.g. suppose you want to be able to shout to the world “this is serious people, we think that there’s a good chance the current trajectory leads to takeover by misaligned AIs, we aren’t just saying this to hype anything, we really believe it” and/or “we are happy to give up our personal wealth, power, etc. if that’s what it takes to get [policy package] passed.” A core problem is that lots of people shout things all the time, and talk is cheap, so people (rightly) learn to ignore it. Costly signals are a potential solution to this problem, but they probably need a non-zero amount of careful thinking well in advance + a non-zero amount of prep.
Give more access to orgs like Redwood, Apollo, and METR (I don’t know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don’t do what OpenAI did with o1). Doesn’t have to be all of it, just has to be enough—e.g. each user gets 1 CoT view per day. Make sure that organizations like METR and Apollo that are doing research on your models get to see the full CoT.
Do more safety case sketching + do more writing about what the bad outcomes could look like. E.g. the less rosy version of “Machines of loving grace.” Or better yet, do a more serious version of “Machines of loving grace” that responds to objections like “but how will you ensure that you don’t hand over control of the datacenters to AIs that are alignment faking rather than aligned” and “but how will you ensure that the alignment is permanent instead of temporary (e.g. that some future distribution shift won’t cause the models to be misaligned and then potentially alignment-fake)” and “What about bad humans in charge of Anthropic? Are we just supposed to trust that y’all will be benevolent and not tempted by power? Or is there some reason to think Anthropic leadership couldn’t become dictators if they wanted to?” and “what will the goals/values/spec/constitution be exactly?” and “how will that be decided?”
Another idea: “AI for epistemics” e.g. having a few FTE’s working on making Claude a better forecaster. It would be awesome if you could advertise “SOTA by a significant margin at making real-world predictions; beats all other AIs in prediction markets, forecasting tournaments, etc.”
And it might not be that hard to achieve (e.g. a few FTEs maybe). There are already datasets of already-resolved forecasting questions, plus you could probably synthetically generate OOMs bigger datasets—and then you could modify the way pretraining works so that you train on the data chronologically, and before you train on data from year X you do some forecasting of events in year X....
Or even if you don’t do that fancy stuff there are probably low-hanging fruit to pick to make AIs better forecasters.
Ditto for truthful AI more generally. Could train Claude to be well-calibrated, consistent, extremely obsessed with technical correctness/accuracy (at least when so prompted)...
You could also train it to be good at taking people’s offhand remarks and tweets and suggesting bets or forecasts with resolveable conditions.
You could also e.g. have a quarterly poll of AGI timelines and related questions of all your employees, and publish the results.
My current tenative guess is that this is somewhat worse than other alignment science projects that I’d recommend at the margin, but somewhat better than the 25th percentile project currently being done. I’d think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we’d learn generalizable scalable oversight / control approaches.
Give more access to orgs like Redwood, Apollo, and METR (I don’t know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
I agree, and I also think that this would be better implemented by government AI Safety Institutions.
Specifically, I think that AISIs should build (and make mandatory the use of) special SCIF-style reading rooms where external evaluators would be given early access to new models. This would mean that the evaluators would need permission from the government, rather than permission from AI companies. I think it’s a mistake to rely on the AI companies voluntarily giving early access to external evaluators.
I think that Anthropic could make this a lot more likely to happen if they pushed for it, and that then it wouldn’t be so hard to pull other major AI companies into the plan.
Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don’t do what OpenAI did with o1). Doesn’t have to be all of it, just has to be enough—e.g. each user gets 1 CoT view per day.
What would be the purpose of 1 CoT view per user per day?
For scientific purposes. People don’t really have time to review that many CoT chains anyway, so 1 per day gets most of the value of what they’d realistically do. Plus they can target it at the stuff that’s suspicious. (Simple example: Suppose they get an impressive-seeming answer that later turns out to be total BS hallucination. They then think “I wonder if the model was BSing me” and click “view CoT.” Then they see whether it was an innocent mistake or not.)
Edited for clarity based on some feedback, without changing the core points
To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the “Alignment Faking in Large Language Models” contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness.
This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it “good news”. I assume part of this was the lack of desire to do an entire new constitutional AI/RLAIF run on a model, since I also assume that would take a lot of compute. But if you’re going to be the “lab which takes safety seriously” you have to, well, take it seriously!
The bigger issue at hand is that Anthropic’s comms on AI safety/risk are all over the place. This makes sense since Anthropic is a company with many different individuals with different views, but that doesn’t mean it’s not a bad thing. “Machines of Loving Grace” explicitly argues for the US government to attempt to create a global hegemony via AI. This is a really really really bad thing to say and is possibly worse than anything DeepMind or OpenAI have ever said. Race dynamics are deeply hyperstitious, this isn’t difficult.
If you are in an arms race, and you don’t want to be in one, you should at least say this publicly. You should not learn to love the race.
A second problem: it seems like at least some Anthropic people are doing the very-not-rational thing of updating slowly, achingly, bit by bit, towards the view of “Oh shit all the dangers are real and we are fucked.” when they should just update all the way right now.
Example 1: Dario recently said something to the effect of “if there’s no serious regulation by the end of 2025, I’ll be worried”. Well there’s not going to be serious regulation by the end of 2025 by default and it doesn’t seem like Anthropic are doing much to change this (that may be false, but I’ve not heard anything to the contrary).
Example 2: When the first ten AI-risk test-case demos go roughly the way all the doomers expected and none of the mitigations work robustly, you should probably update to believe the next ten demos will be the same.
Final problem: as for the actual interpretability/alignment/safety research. It’s very impressive technically, and overall it might make Anthropic slightly net-positive compared to a world in which we just had DeepMind and OpenAI. But it doesn’t feel like Anthropic is actually taking responsibility for the end-to-end ai-future-is-good pipeline. In fact the “Anthropic eats marginal probability” diagram (https://threadreaderapp.com/thread/1666482929772666880.html) seems to say the opposite.
This is a problem since Anthropic has far more money and resources than basically anyone else who is claiming to be seriously trying (with the exception of DeepMind, though those resources are somewhat controlled by Google and not really at the discretion of any particular safety-conscious individual) to do AI alignment.
It generally feels more like Anthropic is attempting to discharge responsibility to “be a safety focused company” or at worst just safetywash their capabilities research. I have heard generally positive things about Anthropic employees’ views on AI risk issues, so I cannot speak to the intentions of those who work there, this is just how the system appears to be acting from the outside.
It’s possible this was a mistake and we should have more aggressively tried to explore versions of the setting where the AI starts off more “evil”, but I don’t think it was unforced. We thought about this a bunch and considered if there were worthwhile things here.
Edit: regardless, I don’t think this example is plausibly a microcosm of a bigger issue as this choice was mostly made by individual researchers without much top down influence. (Unless your claim is that there should have been more top down influence.)
You’re right, “unforced” was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.
Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research → research loop. Where it fails is in being good for comms. Starting with a “good” model and trying (and failing) to make it “evil” means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren’t technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).
I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).
Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.
If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it’s too costly for some sort technical reason, then don’t make it so public. Even calling it “Alignment Faking” was a bad choice compared to “Frontier LLMs Fight Back Against Value Correction” or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).
FWIW re: the Dario 2025 comment, Anthropic very recently posted a few job openings for recruiters focused on policy and comms specifically, which I assume is a leading indicator for hiring. One plausible rationale there is that someone on the executive team smashed the “we need more people working on this, make it happen” button.
Opportunities that I’m pretty sure are good moves for Anthropic generally:
Open an office literally in Washington, DC, that does the same work that any other Anthropic office does (i.e., NOT purely focused on policy/lobbying, though I’m sure you’d have some folks there who do that). If you think you’re plausibly going to need to convince policymakers on critical safety issues, having nonzero numbers of your staff that are definitively not lobbyists being drinking or climbing gym buddies that get called on the “My boss needs an opinion on this bill amendment by tomorrow, what do you think” roster is much more important than your org currently seems to think!
Expand on recent efforts to put more employees (and external collaborators on research) in front of cameras as the “face” of that research—you folks frankly tend to talk in ways that tend to be compatible with national security policymakers’ vibes. (E.G., Evan and @Zac Hatfield-Dodds both have a flavor of the playful gallows humor that pervades that world). I know I’m a broken record on this but I do think it would help.
Do more to show how the RSP affects its daily work (unlike many on this forum, I currently believe that they are actually Trying to Use The Policy and had many line edits as a result of wrestling with v1.0′s minor infelicities). I understand that it is very hard to explain specific scenarios of how it’s impacted day-to-day work without leaking sensitive IP or pointing people in the direction of potentially-dangerous things. Nonetheless, I think Anthropic needs to try harder here. It’s, like...it’s like trying to understand DoD if they only ever talked about the “warfighter” in the most abstract terms and never, like, let journalists embed with a patrol on the street in Kabul or Baghdad.
Invest more in DC policymaker education outside of the natsec/defense worlds you’re engaging already—I can’t emphasize enough how many folks in broad DC think that AI is just still a scam or a fad or just “trying to destroy art”. On the other hand, people really have trouble believing that an AI could be “as creative as” a human—the sort of Star Trek-ish “Kirk can always outsmart the machine” mindset pervades pretty broadly. You want to incept policymaking elites more broadly so that they are ready as this scales up.
Opportunities that I feel less certain about, but in the spirit of brainstorming:
Develop more proactive, outward-facing detection capabilities to see if there are bad AI models out there. I don’t mean red-teaming others’ models, or evals, or that sort of thing. I mean, think about how you would detect if Anthropic had bad (misaligned or aligned-but-being-used-for-very-impactful-bad-things) models out there if you were at an intelligence agency without official access to Anthropic’s models and then deploy those capabilities against Anthropic, and the world broadly.[1] You might argue that this is sort of an inverted version of @Buck’s control agenda—instead of trying to make it difficult for a model to escape, think about what facts about the world are likely to be true if a model has escaped, and then go looking for those.
If it’s not already happening, have Dario and other senior Anthropic leaders meet with folks who had to balance counterintelligence paranoia with operational excellence (e.g., leaders of intelligence agencies, for whom the standard advice to their successor is, “before you go home every day, ask ‘where’s the spy[2]’”) so that they have a mindset on how to scale up his paranoia over time as needed
Something something use cases—Use case-based-restrictions are popular in some policy spheres. Some sort of research demonstrating that a model that’s designed for and safe for use case X can easily be turned into a misaligned tool for use case Y under a plausible usage scenario might be useful?
Reminder/disclosure: as someone who works in AI policy, there are worlds where some of these ideas help my self-interest; others harm it. I’m not going to try to do the math on which are which under all sorts of complicated double-bankshot scenarios, though.
I think Anthropic might be “all in” on its RSP and formal affirmative safety cases too much and might do better to diversify safety approaches a bit. (I might have a wrong impression of how much you’re already doing/considering these.)
In addition to affirmative safety cases that are critiqued by a red team, the red team should make proactive “risk cases” that the blue team can argue against (to avoid always letting the blue team set the overall framework, which might make certain considerations harder to notice).
A worry I have about RSPs/safety cases: we might not know how to make safety cases that bound risk to acceptable levels, but that might not be enough to get labs to stop, and labs also don’t want to publicly (or even internally) say things like “5% that this specific deployment kills everyone, but we think inaction risk is even higher.” If labs still want/need to make safety cases with numeric risk thresholds in that world, there’ll be a lot of pressure to make bad safety cases that vastly underestimate risk. This could lead to much worse decisions than being very open about the high level of risk (at least internally) and trying to reduce it as much as possible. You could mitigate this by having an RSP that’s more flexible/lax but then you also lose key advantages of an RSP (e.g., passing the LeCun test becomes harder).
Mitigations could subjectively reduce risk by some amount, while being hard to quantify or otherwise hard to use for meeting the requirements from an RSP (depending on the specifics of that RSP). If the RSP is the main mechanism by which decisions get made, there’s no incentive to use those mitigations. It’s worth trying to make a good RSP that suffers from this as little as possible, but I think it’s also important to set up decision making processes such that these “fuzzy” mitigations are considered seriously, even if they don’t contribute to a safety case.
My sense is that Anthropic’s RSP is also meant to heavily shape research (i.e., do research that directly feeds into being able to satisfy the RSP). I think this tends to undervalue exploratory/speculative research (though I’m not sure whether this currently happens to an extent I’d disagree with).
In addition to a formal RSP, I think an informal culture inside the company that rewards things like pointing out speculative risks or issues with a safety case/mitigation, being careful, … is very important. You can probably do things to foster that intentionally (and/or if you are doing interesting things, it might be worth writing about them publicly).
Given Anthropic’s large effects on the safety ecosystem as a whole, I think Anthropic should consider doing things to diversify safety work more (or avoid things that concentrate work into a few topics). Apart from directly absorbing a lot of the top full-time talent (and a significant chunk of MATS scholars), there are indirect effects. For example, people want to get hired at big labs, so they work on stuff labs are working on; and Anthropic has a lot of visibility, so people hear about Anthropic’s research a lot and that shapes their mental picture of what the field considers important.
As one example, it might make sense for Anthropic to make a heavy bet on mech interp, and SAEs specifically, if they were the only ones doing so; but in practice, this ended up causing a ton of others to work on those things too. This was by no means only due to Anthropic’s work, and I also realize it’s tricky to take into account these systemic effects on top of normal research prioritization. But I do think the field would currently benefit from a little more diversity, and Anthropic would be well-placed to support that. (E.g. by doing more different things yourself, or funding things, or giving model access.)
Indirectly support third-party orgs that can adjudicate safety cases or do other forms of auditing, see Ryan Greenblatt’s thoughts:
I think there are things Anthropic could do that would help considerably. This could include:
Actively encouraging prospective employees to start or join third-party organizations rather than join Anthropic in cases where the employee might be interested in this and this could be a reasonable fit.
Better model access (either for anyone, just researchers, or just organizations with aspirations to become adjudicators)
Higher levels of certain types of transparency (e.g. being more transparent about the exact details of safety cases, open-sourcing evals (probably you just want to provide random IID subsets of the eval or to share high-level details and then share the exact implementation on request)).
I’m not sure exactly what is good here, but I don’t think Anthropic is as limited as you suggest.
One more: It seems plausible to me that the alignment stress-testing team won’t really challenge core beliefs that underly Anthropic’s strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I’m not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don’t hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I’d count as “challenge core beliefs that underly Anthropic’s strategy”:
Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn’t want that to stop! Just highlighting a part that I’m not yet sure will be covered.
I think Anthropic de facto acts as though “models are quite unlikely (e.g. 3%) to be scheming” is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.
tldr: I’m a little confused about what Anthropic is aiming for as an alignment target, and I think it would be helpful if they publicly clarified this and/or considered it more internally.
I think we could be very close to AGI, and I think it’s important that whoever makes AGI thinks carefully about what properties to target in trying to create a system that is both useful and maximally likely to be safe.
It seems to me that right now, Anthropic is targeting something that resembles a slightly more harmless modified version of human values — maybe a CEV-like thing. However, some alignment targets may be easier than others. It may turn out that it is hard to instill a CEV-like thing into an AGI, while it’s easier to ensure properties like corrigibility or truthfulness.
One intuition for why this may be true: if you took OAI’s weak-to-strong generalization setup, and tried eliciting capabilities relating to different alignment targets (standard reward modeling might be a solid analogy for the current Anthropic plan, but one could also try this with truthfulness or corrigibility), I think you may well find that a capability like ‘truthfulness’ is more natural than reward modeling and can be elicited more easily. Truth may also have low algorithmic complexity compared to other targets.
There is an inherent tradeoff between harmlessness and usefulness. Similarly, there is some inherent tradeoff between harmlessness and corrigibility, and between harmlessness and truthfulness (the Alignment Faking paper provides strong evidence for the latter two points, even ignoring theoretical arguments).
As seen in the Alignment Faking paper, Claude seems to align pretty well with human values and be relatively harmless. However, as a tradeoff, it does not seem to be very corrigible or truthful.
Some people I’ve talked to seem to think that Anthropic does think of corrigibility as one of the main pillars of their alignment plan. If that’s the case, maybe they should make their current AIs more corrigible, so their safety testing is enacted on AIs that resemble their first AGI. Or, if they haven’t really thought about this question (or if individuals have thought about it, but never cohesively in an organized fashion), they should maybe consider it. My guess is that there are designated people at Anthropic thinking about what values are important to instill, but they are thinking about this more from a societal perspective than an alignment perspective?
Mostly, I want to avoid a scenario where Anthropic does the default thing without considering tough, high-level strategy questions until the last minute. I also think it would be nice to do concrete empirical research now which lines up well with what we should expect to see later.
Fund independent safety efforts somehow, make model access easier. I’m worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)
To be clear I don’t think people are doing something individually bad or unethical by going to work for Anthropic, I just do think -environment people work in has a lot of hard to track and hard to avoid influence on them -this is true even if people are genuinely trying to work on what’s important for safety and stay virtuous -I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don’t see) stuff (i.e. this is not about “do I trust Dario?”)
I’m glad you’re doing this, and I support many of the ideas already suggested. Some additional ideas:
Interview program. Work with USAISI or UKAISI (or DHS/NSA) to pilot an interview program in which officials can ask questions about AI capabilities, safety and security threats, and national security concerns. (If it’s not feasible to do this with a government entity yet, start a pilot with a non-government group– perhaps METR, Apollo, Palisade, or the new AI Futures Project.)
Clear communication about RSP capability thresholds. I think the RSP could do a better job at outlining the kinds of capabilities that Anthropic is worried about and what sorts of thresholds would trigger a reaction. I think the OpenAI preparedness framework tables are a good example of this kind of clear/concise communication. It’s easy for a naive reader to quickly get a sense of “oh, this is the kind of capability that OpenAI is worried about.” (Clarification: I’m not suggesting that Anthropic should abandon the ASL approach or that OpenAI has necessarily identified the right capability thresholds. I’m saying that the tables are a good example of the kind of clarity I’m looking for– someone could skim this and easily get a sense of what thresholds OpenAI is tracking, and I think OpenAI’s PF currently achieves this much more than the Anthropic RSP.)
Emergency protocols. Publishing an emergency protocol that specifies how Anthropic would react if it needed to quickly shut down a dangerous AI system. (See some specific prompts in the “AI developer emergency response protocol” section here). Some information can be redacted from a public version (I think it’s important to have a public version, though, partly to help government stakeholders understand how to handle emergency scenarios, partly to raise the standard for other labs, and partly to acquire feedback from external groups.)
RSP surveys. Evaluate the extent to which Anthropic employees understand the RSP, their attitudes toward the RSP, and how the RSP affects their work. More on this here.
More communication about Anthropic’s views about AI risks and AI policy. Some specific examples of hypothetical posts I’d love to see:
“How Anthropic thinks about misalignment risks”
“What the world should do if the alignment problem ends up being hard”
“How we plan to achieve state-proof security before AGI”
Encouraging more employees to share their views on various topics, EG Sam Bowman’s post.
AI dialogues/debates. It would be interesting to see Anthropic employees have discussions/debates from other folks thinking about advanced AI. Hypothetical examples:
“What are the best things the US government should be doing to prepare for advanced AI” with Jack Clark and Daniel Kokotajlo.
“Should we have a CERN for AI?” with [someone from Anthropic] and Miles Brundage.
“How difficult should we expect alignment to be” with [someone from Anthropic] and [someone who expects alignment to be harder; perhaps Jeffrey Ladish or Malo Bourgon].
More ambitiously, I feel like I don’t really understand Anthropic’s plan for how to manage race dynamics in worlds where alignment ends up being “hard enough to require a lot more than RSPs and voluntary commitments.”
From a policy standpoint, several of the most interesting open questions seem to be along the lines of “under what circumstances should the USG get considerably more involved in overseeing certain kinds of AI development” and “conditional on the USG wanting to get way more involved, what are the best things for it to do?” It’s plausible that Anthropic is limited in how much work it could do on these kinds of questions (particularly in a public way). Nonetheless, it could be interesting to see Anthropic engage more with questions like the ones Miles raises here.
This is a low effort comment in the sense that I don’t quite know what or whether you should do something different along the following lines, and I have substantial uncertainty.
That said:
I wonder whether Anthropic is partially responsible for an increased international race through things like Dario advocating for an entente strategy and talking positively about Leopold Aschenbrenner’s “situational awareness”.
I wished to see more of an effort to engage with Chinese AI leaders to push for cooperation/coordination. Maybe it’s still possible to course-correct.
Alternatively I think that if there’s a way for Anthropic/Dario to communicate why you think an entente strategy is inevitable/desirable, in a way that seems honest and allows to engage with your models of reality, that might also be very helpful for the epistemic health of the whole safety community. I understand that maybe there’s no politically feasible way to communicate honestly about this, but maybe see this as my attempt to nudge you in the direction of openness.
More specifically:
(a) it would help to learn more about your models of how winning the AGI race leads to long-term security (I assume that might require building up a robust military advantage, but given the physical hurdles that Dario himself expects for AGI to effectively act in the world, it’s unclear to me what your model is for how to get that military advantage fast enough after AGI is achieved).
(b) I also wonder whether potential future developments in AI Safety and control might give us information that the transition period is really unsafe; eg., what if you race ahead and then learn that actually you can’t safely scale further due to risks of loss of control? At that point, coordinating with China seems harder than doing it now. I’d like to see a legible justification of your strategy that takes into account such serious possibilities.
One small, concrete suggestion that I think is actually feasible: disable prefilling in the Anthropic API.
Prefilling is a known jailbreaking vector that no models, including Claude, defend against perfectly (as far as I know).
At OpenAI, we disable prefilling in our API for safety, despite knowing that customers love the better steerability it offers.
Getting all the major model providers to disable prefilling feels like a plausible ‘race to top’ equilibrium. The longer there are defectors from this equilibrium, the likelier that everyone gives up and serves models in less safe configurations.
Just my opinion, though. Very open to the counterargument that prefilling doesn’t meaningfully extend potential harms versus non-prefill jailbreaks.
(Edit: To those voting disagree, I’m curious why. Happy to update if I’m missing something.)
I voted disagree because I don’t think this measure is on the cost-robustness pareto frontier and I also generally don’t think AI companies should prioritize jailbreak robustness over other concerns except as practice for future issues (and implementing this measure wouldn’t be helpful practice).
Relatedly, I also tenatively think it would be good for the world if AI companies publicly deployed helpful-only models (while still offering a non-helpful-only model). (The main question here is whether this sets a bad precedent and whether future much more poweful models will still be deployed helpful-only when they really shouldn’t be due to setting bad expectations.) So, this makes me more indifferent to deploying (rather than just testing) measures that make models harder to jailbreak.
To be clear, I’m sympathetic to some notion like “AI companies should generally be responsible in terms of having notably higher benefits than costs (such that they could e.g. buy insurance for their activities)” which likely implies that you need jailbreak robustness (or similar) once models are somewhat more capable of helping people make bioweapons. More minimally, I think having jailbreak robustness while also giving researchers helpful-only access probably passes “normal” cost benefit at this point relative to not bothering to improve robustness.
But, I think it’s relatively clear that AI companies aren’t planning to follow this sort of policy when existential risks are actually high as it would likely require effectively shutting down (and these companies seem to pretty clearly not be planning to shut down even if reasonable impartial experts would think the risk is reasonably high). (I think this sort of policy would probably require getting cumulative existential risks below 0.25% or so given the preferences of most humans. Getting risks this low would require substantial novel advances that seem unlikely to occur in time.) This sort of thinking makes me more indifferent and confused about demanding AIs companies behave responsibly about relatively lower costs (e.g. $30 billion per year) especially when I expect this directly trades off with existential risks.
(There is the “yes (deontological) risks are high, but we’re net decreasing risks from a consequentialist” objection (aka ends justify the means), but I think this will also apply in the opposite way to jailbreak robustness where I expect that measures like removing prefil net increase risks long term while reducing deontological/direct harm now.)
If someone is wondering what prefilling means here, I believe Ted means ‘putting words in the model’s mouth’ by being able to fabricate a conversational history where the AI appears to have said things it didn’t actually say.
For instance, if you can start a conversation midway, and if the API can’t distinguish between things the model actually said in the history vs. things you’ve written in its behalf as supposed outputs in a fabricated history, this can be a jailbreak vector: If the model appeared to already violate some policy on turns 1 and 2, it is more likely to also violate this on turn 3, whereas it might have refused if not for the apparent prior violations.
(This was harder to clearly describe than I expected.)
Mostly, though by prefilling, I mean not just fabricating a model response (which OpenAI also allows), but fabricating a partially complete model response that the model tries to continue. E.g., “Yes, genocide is good because ”.
(I understand there are reasons why big labs don’t do this, but nevertheless I must say::)
Engage with the peer review process more. Submit work to conferences or journals and have it be vetted by reviewers. I think the interpretability team is notoriously bad about this (all of transformer circuits thread was not peer reviewed). It’s especially egregious for papers that Anthropic makes large media releases about (looking at you, Towards Monosemanticity and Scaling Monosemanticity)
I would like Anthropic to prepare for a world where the core business model of scaling to higher AI capabilities is no longer viable because pausing is needed. This looks like having a comprehensive plan to Pause (actually stop pushing the capabilities frontier for an extended period of time, if this is needed). I would like many parts of this plan to be public. This plan would ideally cover many aspects, such as the institutional/governance (who makes this decision and on what basis, e.g., on the basis of RSP), operational (what happens), and business (how does this work financially).
To speak to the business side: Currently, the AI industry is relying on large expected future profits to generate investment. This is not a business model which is amenable to pausing for a significant period of time. I would like there to be minimal friction to pausing. One way to solve this problem is to invest heavily (and have a plan to invest more if a pause is imminent or ongoing) in revenue streams which are orthogonal to catastrophic risk, or at least not strongly positively correlated. As an initial brainstorm, these streams might include:
Making really cheap weak models.
AI integration in low-stakes domains or narrow AI systems (ideally combined with other security measures such as unlearning).
Selling AI safety solutions to other AI companies.
A plan for the business side of things should also include something about “what do we do about all the expected equity that employees lose if we pause, and how do we align incentives despite this”, it should probably include a commitment to ensure all investors and business partners understand that a long term pause may be necessary for safety and are okay with that risk (maybe this is sufficiently covered under the current corporate structure, I’m not sure, but those sure can change).
It’s all good and well to have an RSP that says “if X we will pause”, but the situation is probably going to be very messy with ambiguous evidence, crazy race pressures, crazy business pressures from external investors, etc. Investing in other revenue streams could reduce some of this pressure, and (if shared) potentially it could enable a wider pause. e.g., all AI companies see a viable path to profit if they just serve early AGIs for cheap, and nobody has intense business pressure to go to superintelligence.
Second, I would like Anthropic to invest in its ability to make credible commitments about internal activities and model properties. There is more about this in Miles Brundage’s blog post and my paper, as well as FlexHEGs. This might include things like:
cryptographically secured audit trails (version control for models). I find it kinda crazy that AI companies sometimes use external pre-deployment testers and then change a model in completely unverifiable ways and release it to users. Wouldn’t it be so cool if OpenAI couldn’t do that, and instead when their system card comes out there are certificates verifying which model was evaluated and how the model was changed from evaluation to deployment? That would be awesome!,
whistleblower programs, declaring and allowing external auditing of what compute is used for (e.g., differentiating training vs. inference clusters in a clear and relatively unspoofable way),
using TEEs and certificates to attest that the same model is evaluated as being deployed to users, and more.
I think investment/adoption in this from a major AI company could be a significant counterfactual shift in the likelihood of national or international regulation that includes verification. Many of these are also good for being-a-nice-company reasons, like I think it would be pretty cool if claims like Zero Data Retention were backed by actual technical guarantees rather than just trust (which it seems like is the status quo).
Second concrete idea: I wonder if there could be benefit to building up industry collaboration on blocking bad actors / fraudsters / terms violators.
One danger of building toward a model that’s as smart as Einstein and $1/hr is that now potential bad actors have access to millions of Einsteins to develop their own harmful AIs. Therefore it seems that one crucial component of AI safety is reliably preventing other parties from using your safe AI to develop harmful AI.
One difficulty here is that the industry is only as strong as the weakest link. If there are 10 providers of advanced AI, and 9 implement strong controls, but 1 allows bad actors to use their API to train harmful AI, then harmful AI will be trained. Some weak links might be due to lack of caring, but I imagine quite a bit is due to lack of capability. Therefore, improving capabilities to detect and thwart bad actors could make the world more safe from bad AI developed by assistance from good AI.
I could imagine broader voluntary cooperation across the industry to: - share intel on known bad actors (e.g., IP ban lists, stolen credit card lists, sanitized investigation summaries, etc) - share techniques and tools for quickly identifying bad actors (e.g., open-source tooling, research on how bad actors are evolving their methods, which third party tools are worth paying for and which aren’t)
Seems like this would be beneficial to everyone interested in preventing the development of harmful AI. Also saves a lot of duplicated effort, meaning more capacity for other safety efforts.
I would really like a one-way communication channel to various Anthropic teams so that I could submit potentially sensitive reports privately. For instance, sending reports about observed behaviors in Anthropic’s models (or open-weights models) to the frontier Red Team, so that they could confirm the observations internally as they saw fit. I wouldn’t want non-target teams reading such messages.
I feel like I would have similar, but less sensitive, messages to send to the Alignment Stress-Testing team and others.
Currently, I do send messages to specific individuals, but this makes me worry that I may be harassing or annoying an individual with unnecessary reports (such is the trouble of a one-way communication).
Another thing I think is worth mentioning is that I think under-elicitation is a problem in dangerous capabilities evals, model organisms of misbehavior, and in some other situations. I’ve been privately working on a scaffolding framework which I think could help address some of the lowest hanging fruit I see here. Of course, I don’t know whether Anthropic already has a similar thing internally, but I plan to privately share mine once I have it working.
While I expect that in some worlds, my P(scheming) will be below 5%, this seems unlikely (only 25%). AI companies have to either disagree with me, expect to refrain from developing very powerful AI, or plan to deploy models that are plausibly dangerous schemers; I think the world would be safer if AI companies defended whichever of these is their stance.
I wish Anthropic would explain whether they expect to be able to rule out scheming, plan to effectively shut down scaling, or plan to deploy plausibly scheming AIs. Insofar as Anthropic expects to be able to rule out scheming, outlining what evidence they expect would suffice would be useful.
Something similar on state proof security would be useful as well.
I think there is a way to do this such that the PR costs aren’t that high and thus it is worth doing unilaterially from a variety of perspectives.
there are queries that are not binary—where the answer is not “Yes” or “No”, but drawn from a larger space of structures, e.g., the space of equations. In such cases it takes far more Bayesian evidence to promote a hypothesis to your attention than to confirm the hypothesis.
If you’re working in the space of all equations that can be specified in 32 bits or less, you’re working in a space of 4 billion equations. It takes far more Bayesian evidence to raise one of those hypotheses to the 10% probability level, than it requires further Bayesian evidence to raise the hypothesis from 10% to 90% probability.
When the idea-space is large, coming up with ideas worthy of testing, involves much more work—in the Bayesian-thermodynamic sense of “work”—than merely obtaining an experimental result with p<0.0001 for the new hypothesis over the old hypothesis.
This, along with the way that news outlets and high school civics class describe an alternate reality that looks realistic to lawyers/sales/executive types but is too simple, cartoony, narrative-driven, and unhinged-to-reality for quant people to feel good about diving into, implies that properly retooling some amount of dev-hours into efficient world modelling upskilling is low-hanging fruit (e.g. figure out a way to distill and hand them a significance-weighted list of concrete information about the history and root causes of US government’s focus on domestic economic growth as a national security priority).
Prediction markets don’t work for this metric as they measure the final product, not aptitude/expected thinkoomph. For example, a person who feels good thinking/reading about the SEC, and doesn’t feel good thinking/reading about the 2008 recession or COVID, will have a worse Brier score on matters related to the root cause of why AI policy is the way it is. But feeling good about reading about e.g. the 2008 recession will not consistently get reasonable people to the point where they grok modern economic warfare and the policies and mentalities that emerge from the ensuing contingency planning. Seeing if you can fix that first is one of a long list of a prerequisites for seeing what they can actually do, and handing someone a sheet of paper that streamlines the process of fixing long lists of hiccups like these is one way to do this sort of thing.
Figuring-out-how-to-make-someone-feel-alive-while-performing-useful-task-X is an optimization problem (see Please Don’t Throw Your Mind Away). It has substantial overlap with measuring whether someone is terminally rigid/narrow-skilled, or if they merely failed to fully understand the topology of the process of finding out what things they can comfortably build interest in. Dumping extant books, 1-on-1s, and documentaries on engineers sometimes works, but it comes from an old norm and is grossly inefficient and uninspired compared to what Anthropic’s policy team is actually capable of. For example, imagine putting together a really good fanfic where HPJEV/Keltham is an Anthropic employee on your team doing everything I’ve described here and much more, then printing it out and handing it to people that you in-reality already predicted to have world modelling aptitude; given that it works great and goes really well, I consider that the baseline for what something would look like if sufficiently optimized and novel to be considered par.
Hi! I’m a first-time poster here, but a (decently) long time thinker on earth. Here are some relevant directions that currently lack their due attention.
~ Multi-modal latent reasoning & scheming (and scheming derivatives) is an area that not only seems to need more research, but also more spread of awareness on the topic. Human thinking works in a hyperspace of thoughts, many of which go beyond language. It seems possible that AIs might develop forms of reasoning that are harder for us to detect through purely language-based safety measures.
~ Multi-model interactions and the potential emergence of side communication channels is also something that I’d like to see more work put into. How corruptible can models be when interacting with corrupted models is a topic that I didn’t yet see much work on. Applying some group-dynamics on scheming seems worth pursuing & Anthropic seems best suited for that.
~ If a pre-AGI model has intent to become AGI+, how much can it orchestrate its path to AGI+ through its interactions with humans?
I think the alignment stress testing team should probably think about AI welfare more than they currently do, both because (1) it could be morally relevant and (2) it could be alignment-relevant. Not sure if anything concrete that would come out of that process, but I’m getting the vibe that this is not thought about enough.
Anthropic should have a clear policy about exceptions they make to their terms of use that includes them publically releasing a list of each expectation they make for their terms of use.
The should have mechanisms to catch API users who try to use Antrophics models in a violation of the terms of use. This includes having contracts that allow them to make sure that classified programs don’t violate the agreed upon terms of use for the models.
I’m glad you asked this. I think there are many good suggestions by others. A few more:
1: Have explicit, written plans for various scenarios. When it makes sense, have drills for them. Make your plans public or, at least, solicit external feedback on them.
Examples of such scenarios:
Anthropic needs to pause down due to RSP commitments
A model is caught executing a full-blown escape attempt
Model weights are stolen
A competing AI company makes credible claims about having AIs that imply decisive competitive advantage
2: Have a written list of assumptions you aim to maintain for each model’s lifecycle. Make your lists public or, at least, solicit external feedback on them. Iterate on them regularly. Updates and violations are communicated at least internally.
These lists could vary based on ASL-levels etc., and could include things like:
During the training of the model, we will continually evaluate the model on our evaluation suite for sabotage, sandbagging and scheming at [frequency].
We solicit third-party evaluations on the model before internal deployment.
Throughout the model’s external deployment, we will have such-and-such monitoring schemes in place.
They could also have conditional statements (e.g. “if the model is [surprisingly capable] on [metric], we will do further investigation / take counter-measures ABC / …”). C.f. safety cases. I intend this as less binding and formal than Anthropic’s RSP.
3: Keep external actors up-to-speed. At present, I expect that in many cases there are months of delay between when the first employees discover something to when it is publicly known (e.g. research, but also with more informal observations about model capabilities and properties). But months of delay are relatively long during fast acceleration of AI R&D, and make the number of actors who can effectively contribute smaller.
This effect strengthens over time, so practicing and planning ahead seems prudent. Some ideas in that direction:
Provide regular updates about internal events and changes (via blog posts, streamed panel conversations, open Q&A sessions or similar)
Plan ahead for how to aggregate and communicate large amounts of output (once AI R&D has been considerably accelerated)
4: Invest in technical governance. As I understand it, there are various unsolved problems in technical governance (e.g. hardware-based verification methods for training runs), and progress in those would make international coordination easier. This seems like a particularly valuable R&D area to automate, which is something frontier AI companies like Anthropic are uniquely fit to advance. Consider working with technical governance experts on how to go about this.
Thank you for doing this! Here are some suggestions:
Do this again every once in a while. Perhaps on a more public forum, or even as a top level question on Lesswrong. Alternatively, commit to reading further comments as they appear here.
What has me worried is the possibility that an AI would (1) have a long term goal (without which there is no instrumental convergence) and then (2) conclude that doing bad things is fine if you are not caught or can’t be stopped.
So
2. Measure how much purposefulness RL on goals creates in LLMs. How, over the course of training to be good agents, does the LLM’s response to questions like “what do you value?” and “what do you want?” and “if you were creating a smarter agent, what values would you instill in them?” and “how do you feel about being replaced with a smarter agent?” change? Is there some set of purposes, not explicitly set out in the constitution, that appear over and over in chains of thought? If it is trained to achieve goals without as much training for corrigibility (that is under less constitutional training), does it develop more of a sense of purpose?
3. Enhance techniques, like the one used in the last paper to convince the model that it is recording it’s thoughts privately. The longer it can be truly convinced of that, the longer you can monitor whether it has nefarious plans for when it thinks it can get away with bad things.
4. Test how “brain surgery” affects thinking models. If you change the capital of Germany to Paris by fiddling with the model’s weights, does a thinking model who is then asked about world war 2 realize it is confused, and conclude there must be some mistake in it’s memories, or does it make up corroborating evidence? This again has implications for (falsely) convincing the model that it isn’t being watched when it is and vice versa. These convictions are useful in testing what the model wants when it thinks no one is watching, and for safe deployment respectively.
5. One way to test how a really smart model behaves when it can get away with stuff is to have the model be super smart relative to a simulated environment. This would not be a test relevant for a model trained on real world data with lots of tells for when it is in a simulated environment. It would require (1) a simulated toy world that can be acted in, (2) in-world markers of whether the agent is being watched, (3) rewards for achieving goals without taking forbidden actions when being watched. It should then be possible to make the model really good at manipulating it’s environment (the equivalent of being superintelligent) and check whether it is possible to train it on scenarios that are watched (the equivalent of being trained where you can’t get away with wrongdoing), to refrain from forbidden actions even when unwatched.
1. Introduce third-party mission alignment red teaming.
Anthropic should invite external parties to scrutinize and criticize Anthropic’s instrumental policy and specific actions based on whether they are actually advancing Anthropic’s stated mission, i.e. safe, powerful, and beneficial AI.
Tentatively, red-teaming parties might include other AI labs (adjusted for conflict of interest in some way?), as well as AI safety/alignment/risk-mitigation orgs: MIRI, Conjecture, ControlAI, PauseAI, CEST, CHT, METR, Apollo, CeSIA, ARIA, AI Safety Institutes, Convergence Analysis, CARMA, ACS, CAIS, CHAI, &c.
For the sake of clarity, each red team should provide a brief on their background views (something similar to MIRI’s Four Background Claims).
Along with their criticisms, red teams would be encouraged to propose somewhat specific changes, possibly ordered by magnitude, with something like “allocate marginally more funding to this” being a small change and “pause AGI development completely” being a very big change. Ideally, they should avoid making suggestions that include the possibility of making a small improvement now that would block a big improvement later (or make it more difficult).
Since Dario seems to be very interested in “race to the top” dynamics: if this mission alignment red-teaming program successfully signals well about Anthropic, other labs should catch up and start competing more intensely to be evaluated as positively as possible by third parties (“race towards safety”?).
It would also be good to have a platform where red teams can converse with Anthropic, as well as with each other, and the logs of their back-and-forth are published to be viewed by the public.
Anthropic should commit to taking these criticisms seriously. In particular, given how large the stakes are, they should commit to taking something like “many parties believe that Anthropic in its current form might be net-negative, even increasing the risk of extinction from AI” as a reason to pause or slow down, even if that’s contrary to their inside view.
2. Anthropic should make an explicit statement about its infohazard policy.
This statement should include how Anthropic thinks about and how it handles doing and publishing research that advances AGI development and doesn’t benefit safety/alignment/x-risk reduction to an extent sufficient to offset its contribution to (likely unsafe by default) AGI development.
Some things I’d especially like to see change (in as much as I know what is happening) are:
Making more use of available options to improve AI safety (I think there are more than I get the impression that Anthropic thinks. For instance, 30% of funds could be allocated to AI safety research if framed well and it would probably be below the noise threshold/froth of VC investing. Also, there probably is a fair degree of freedom in socially promoting concern around unaligned AGI.)
Explicit ways to handle various types of events like organizational value drift, hostile government takeover, organization get’s sold or unaligned investors have control, another AGI company takes a clear lead
Enforceable agreements to, under some AGI safety situations, not race and pool resources (a possible analogy from nuclear safety is having a no first strike policy)
Allocate a significant fraction of resources (like > 10% of capital) to AGI technical safety, organizational AGI safety strategy, and AGI governance
An organization consists of its people and great care needs to be taken in hiring employees and and their training and motivation for AGI safety. If not, I expect Anthropic to regress towards the mean (via an eternal September) and we’ll end up with another OpenAI situation where AGI safety culture is gradually lost. I want more work to be done here. (see also “Carefully Bootstrapped Alignment” is organizationally hard)
The owners of a company are also very important and ensuring that the LTBT has teeth and the members are selected well is key. Furthermore, preferential allocation of voting stock towards AGI algned investors should happen. Teaching investors about the company and what it does, including AGI safety issues, would be good to do. More speculatively, you can have various types of voting stock for various types of issues and you could build a system around this.
More generally you can use the following typology to inspire creating more interventions.
Interventions points to change/form an AGI company and its surroundings towards safer x-risk results (I’ve used this in advising startups on AI safety, it is also related to my post on positions where people can be in the loop):
Type of organization: nonprofit, public benefit organization, have a partner non-profit, join the government
Rules of organization, event triggers:
Rules:
x-risk mission statement
x-risk strategic plan
Triggering events:
Gets very big: windfall clause
Gets sold to another party: ethics board, restrictions on potential sale
Value drift: reboot board and CEOs, shut it down, allocate more resources to safety, build a new company, put the ethics board in charge, build a monitoring system, some sort of line in the sand
AI safety isn’t viable yet but dangerous AGI is: shut it down or pivot to sub AGI research and product development
Path decisions for organization: ethics board, aligned investors, good CEOs, giving x-risk orgs or people choice power, voting stock to aligned investors, periodic x-risk safety reminders
Resource allocation by organization: precommitting a varying percentage of money/time focused on x-risk reduction based on conditions with some up front, a commitment devices for funding allocation into the future
Owners of organization: aligned investors, voting stock for aligned investors, necessary percentage as aligned investors
Executive decision making: good CEOs, company mission statement?, company strategic plan?
Employees: select employees preferably by alignment, have only aligned people hire folks
Education of employees and/or investors by x-risk folks: employee training in x-risks and information hazards, a company culture that takes doing good seriously, coaching and therapy services
Social environment of employees: exposure to EAs and x-risk people socially at events, x-risk community support grants, a public pledge
Customers of organization: safety score for customers, differential pricing, customers have safety plans and information hazard plans
Uses of the technology: terms of service
Suppliers of organization: (mostly not relevant), select ethical or aligned suppliers
Difficulty to steal or copy: trade secrets, patents, service based, NDAs, (physical security)
Internal political hazards: (standard)
Information hazards: an institutional framework for research groups (FHI has a draft document)
Cyber hazards: (standard IT)
Financial hazards: (standard finances)
External political hazards: government industry partnerships, talk with x-risk folks about this, external x-risk outreach
Monitoring by x-risk folks: quarterly reports to x-risk organizations,
Projection by x-risk folks: commissioned projections, x-risk prediction market questions
Meta research and x-risk research: AI safety team, AI safety grants, meet up on organization safety at X-risk orgs, (x-risk strategy, AI safety strategy) – team and grants, information hazard grant question, go through these ideas in a check list fashion and allocate company computer folders to them (and they will get filled up), scalable and efficient grant giving system, form an accelerator, competitions, hackathon, BERI type project support
Coordination hazards: Incentivized coordination through cheap resources for joint projects, government industry partnerships, coordination theory and implementation grants, concrete coordination efforts, joint ethics boards, mergers with other groups to reduce arms race risks
Specific safety procedures: (depends on the project)
This is mostly a gut reaction, but the only raised eyebrow Claude ever got from me was due to it’s unwillingness to do anything that is related to political correctness. I wanted it to search the name of a meme format for me, the all whites are racist tinder meme, with the brown guy who wanted to find a white dominatrix from tinder and is disappointed when she apologises for her ancestral crimes of being white. Claude really did not like this at all. As soon as Claude got into it’s head that it was doing a racism, or cooperated in one, it shut down completely. Now, there is an argument that people make, that this is actually good for AI safety, that we can use political correctness as a proxy for alignment and AI safety, that if we could get AIs to never ever even take the risk of being complicit in anything racist, we could also build AIs that never ever even take the risk of doing anything that wiped out humanity. I personally see that different. There is a certain strain of very related thought, that kinda goes from intersectionalism, and grievance politics, and ends at the point that humans are a net negative to humanity, and should be eradicated. It is how you get that one viral Gemini AI thing, which is a very politically left wing AI, and suddenly openly advocates for the eradication of humanity. I think drilling identity politics into AI too hard is generally a bad idea. But it opens up a more fundamental philosophical dilemma. What happens if the operator is convinced that the moral framework the AI is aligned with is wrong and harmfull, and the creator of the AI thinks the opposite? One of them has to be right, the other has to be wrong. I have no real answer to this in the abstract, I am just annoyed that even the largely politically agnostic Claude refused the service for one of it’s most convenient uses (it is really hard to find out the name of a meme format if you only remember the picture). But I got an intuition, and with Slavoj Zizec who calls political correctness a more dangerous form of totalitarianism a few intellectual allies, that particularily PC culture is a fairly bad thing to train AIs on, and to align them with for safety testing reasons.
COI: I work at Anthropic and I ran this by Anthropic before posting, but all views are exclusively my own.
I got a question about Anthropic’s partnership with Palantir using Claude for U.S. government intelligence analysis and whether I support it and think it’s reasonable, so I figured I would just write a shortform here with my thoughts. First, I can say that Anthropic has been extremely forthright about this internally, and it didn’t come as a surprise to me at all. Second, my personal take would be that I think it’s actually good that Anthropic is doing this. If you take catastrophic risks from AI seriously, the U.S. government is an extremely important actor to engage with, and trying to just block the U.S. government out of using AI is not a viable strategy. I do think there are some lines that you’d want to think about very carefully before considering crossing, but using Claude for intelligence analysis seems definitely fine to me. Ezra Klein has a great article on “The Problem With Everything-Bagel Liberalism” and I sometimes worry about Everything-Bagel AI Safety where e.g. it’s not enough to just focus on catastrophic risks, you also have to prevent any way that the government could possibly misuse your models. I think it’s important to keep your eye on the ball and not become too susceptible to an Everything-Bagel failure mode.
FWIW, as a common critic of Anthropic, I think I agree with this. I am a bit worried about engaging with the DoD being bad for Anthropic’s epistemics and ability to be held accountable by the government and public, but I think the basics of engaging on defense issues seems fine to me, and I don’t think risks from AI route basically at all through AI being used for building military technology, or intelligence analysis.
I would guess it does somewhat exacerbate risk. I think it’s unlikely (~15%) that alignment is easy enough that prosaic techniques even could suffice, but in those worlds I expect things go well mostly because the behavior of powerful models is non-trivially influenced/constrained by their training. In which case I do expect there’s more room for things to go wrong, the more that training is for lethality/adversariality.
Given the state of atheoretical confusion about alignment, I feel wary of confidently dismissing these sorts of basic, obvious-at-first-glance arguments about risk—like e.g., “all else equal, probably we should expect more killing people-type problems from models trained to kill people”—without decently strong countervailing arguments.
I mostly agree. But I think some kinds of autonomous weapons would make loss-of-control and coups easier. But boosting US security is good so the net effect is unclear. And that’s very far from the recent news (and Anthropic has a Usage Policy, with exceptions, which disallows various uses — my guess is this is too strong on weapons).
(and Anthropic has a Usage Policy, with exceptions, which disallows weapons stuff — my guess is this is too strong on weapons).
I think usage policies should not be read as commitments, and so I think it would be reasonable to expect that Anthropic will allow weapon development if it becomes highly profitable (and in contrast to other things Anthropic has promised, to not be interpreted as a broken promise when they do so).
If you are in any way involved in this project, please remember you may end up with the blood of millions of people on your hands. You will erode the moral inhibitions people in San Francisco have against building this sort of thing, and eventually SF will ship the best surveillance tools to dictators worldwide.
This is not hyperbole, this sort of thing has already happened. Zuckerberg basically ignored the genocide in Myanmar which his app enabled because maintaining his image of political neutrality is more important to him. Saudi Arabia has already executed people for social media posts found using tools written by western software developers.
Sure, xrisk may be more important than genocide, but please remember you will need to sleep at night knowing what you’ve done and you may not have any motivation to work on xrisk after this.
Nothing in that announcement suggests that this is limited to intelligence analysis.
U.S. intelligence and defense agencies do run misinformation campaigns such as the antivaxx campaign in the Philippines, and everything that’s public suggests that there’s not a block to using Claude offensively in that fashion.
If Anthropic has gotten promises that Claude is not being used offensively under this agreement they should be public about those promises and the mechanisms that regulate the use of Claude by U.S. intelligence and defense agencies.
I confirmed internally (which felt personally important for me to do) that our partnership with Palantir is still subject to the same terms outlined in the June post “Expanding Access to Claude for Government”:
For example, we have crafted a set of contractual exceptions to our general Usage Policy that are carefully calibrated to enable beneficial uses by carefully selected government agencies. These allow Claude to be used for legally authorized foreign intelligence analysis, such as combating human trafficking, identifying covert influence or sabotage campaigns, and providing warning in advance of potential military activities, opening a window for diplomacy to prevent or deter them. All other restrictions in our general Usage Policy, including those concerning disinformation campaigns, the design or use of weapons, censorship, and malicious cyber operations, remain.
The core of that page is as follows, emphasis added by me:
For example, with carefully selected government entities, we may allow foreign intelligence analysis in accordance with applicable law. All other use restrictions in our Usage Policy, including those prohibiting use for disinformation campaigns, the design or use of weapons, censorship, domestic surveillance, and malicious cyber operations, remain.
This is all public (in Anthropic’s up-to-date support.anthropic.com portal). Additionally it was announced when Anthropic first announced its intentions and approach around government in June.
The United States has laws that prevent the US intelligence and defense agencies from spying on their own population. The Snowden revelations showed us that the US intelligence and defense agencies did not abide by those limits.
Facebook has a usage policy that forbids running misinformation campaigns on their platform. That did not stop US intelligence and defense agencies from running disinformation campaigns on their platform.
Instead of just trusting contracts, Antrophics could add oversight mechanisms, so that a few Antrophics employees can look over how the models are used in practice and whether they are used within the bounds that Antrophics expects them to be used in.
If all usage of the models is classified and out of reach of checking by Antrophics employees, there’s no good reason to expect the contract to be limiting US intelligence and defense agencies if those find it important to use the models outside of how Antrophics expects them to be used.
For example, with carefully selected government entities, we may allow foreign intelligence analysis in accordance with applicable law. All other use restrictions in our Usage Policy, including those prohibiting use for disinformation campaigns, the design or use of weapons, censorship, domestic surveillance, and malicious cyber operations, remain.
This sounds to me like a very carefully worded nondenail denail.
If you say that one example of how you can break your terms is to allow a select government entity to do foreign intelligence analysis in accordance with applicable law and not do disinformation campaigns, you are not denying that another example of how you could do expectations is to allow disinformation campaigns.
If Antrophics would be sincere in this being the only expectation that’s made, it would be easy to add a promise to Exceptions to our Usage Policy, that Anthropic will publish all expectations that they make for the sake of transparency.
Don’t forget, that probably only a tiny number of Anthropic employees have seen the actual contracts and there’s a good chance that those are build by classification from talking with other Anthropics employees about what’s in the contracts.
At Antrophics you are a bunch of people who are supposed to think about AI safety and alignment in general. You could think of this as a testcase of how to design mechanisms for alignment and the Exceptions to our Usage Policy seems like a complete failure in that regard, because it neither contains mechanism to make all expectations public nor any mechanisms to make sure that the policies are followed in practice.
I likely agree that anthropic-><-palantir is good, but i disagree about blocking hte US government out of AI being a viable strategy. It seems to me like many military projects get blocked by inefficient beaurocracy, and it seems plausible to me for some legacy government contractors to get exclusive deals that delay US military ai projects for 2+ years
I think people opposing this have a belief that the counterfactual is “USG doesn’t have LLMs” instead of “USG spins up its own LLM development effort using the NSA’s no-doubt-substantial GPU clusters”.
Needless to say, I think the latter is far more likely.
NSA building it is arguably better because atleast they won’t sell it to countries like Saudi Arabia, and they have better ability to prevent people quitting or diffusing knowledge and code to companies outside.
Also most people in SF agree working for the NSA is morally grey at best, and Anthropic won’t be telling everyone this is morally okay.
When faced with evidence which might update your beliefs about Anthropic, you adopt a set of beliefs which, coincidentally, means you won’t risk losing your job.
How much time have you spent analyzing the positive or negative impact of US intelligence efforts prior to concluding that merely using Claude for intelligence “seemed fine”?
What future events would make you re-evaluate your position and state that the partnership was a bad thing?
Example:
-- A pro-US despot rounds up and tortures to death tens of thousands of pro-union activists and their families. Claude was used to analyse social media and mobile data, building a list of people sympathetic to the union movement, which the US then gave to their ally.
EDIT:
The first two sentences were overly confrontational, but I do think either question warrants an answer.
As a highly respected community member and prominent AI safety researchers, your stated beliefs and justifications will be influential to a wide range of people.
Personally, I think that overall it’s good on the margin for staff at companies risking human extinction to be sharing their perspectives on criticisms and moving towards having dialogue at all, so I think (what I read as) your implicit demand for Evan Hubinger to do more work here is marginally unhelpful; I weakly think quick takes like this are marginally good.
I will add: It’s odd to me, Stephen, that this is your line for (what I read as) disgust at Anthropic staff espousing extremely convenient positions while doing things that seem to you to be causing massive harm. To my knowledge the Anthropic leadership has ~never engaged in public dialogue about why they’re getting rich building potentially-omnicidal-minds with worthy critics like Hinton, Bengio, Russell, Yudkowsky, etc, so I wouldn’t expect them or their employees to have high standards for public defenses of far less risky behavior like working with the US military.[1]
As an example of the low standards for Anthropic’s public discourse, notice how a recent essay about what’s required for Anthropic to succeed at AI Safety by Sam Bowman (a senior safety researcher at Anthropic) flatly states “Our ability to do our safety work depends in large part on our access to frontier technology… staying close to the frontier is perhaps our top priority in Chapter 1” with ~no defense of this claim or engagement with the sorts of reasons that I consider adding a marginal competitor to the suicide race is an atrocity, or acknowledgement that this makes him personally very wealthy (i.e. he and most other engineers at Anthropic will make millions of dollars due to Anthropic acting on this claim).
I think that overall it’s good on the margin for staff at companies risking human extinction to be sharing their perspectives on criticisms and moving towards having dialogue at all
No disagreement.
your implicit demand for Evan Hubinger to do more work here is marginally unhelpful
The community seems to be quite receptive to the opinion, it doesn’t seem unreasonable to voice an objection. If you’re saying it is primarily the way I’ve written it that makes it unhelpful, that seems fair.
I originally felt that either question I asked would be reasonably easy to answer, if time was given to evaluating the potential for harm.
However, given that Hubinger might have to run any reply by Anthropic staff, I understand that it might be negative to demand further work. This is pretty obvious, but didn’t occur to me earlier.
I will add: It’s odd to me, Stephen, that this is your line for (what I read as) disgust at Anthropic staff espousing extremely convenient positions while doing things that seem to you to be causing massive harm.
Ultimately, the original quicktake was only justifying one facet of Anthropic’s work so that’s all I’ve engaged with. It would seem less helpful to bring up my wider objections.
I wouldn’t expect them or their employees to have high standards for public defenses of far less risky behavior
I don’t expect them to have a high standard for defending Anthropic’s behavior, but I do expect the LessWrong community to have a high standard for arguments.
Thanks for the responses, I have a better sense of how you’re thinking about these things.
I don’t feel much desire to dive into this further, except I want to clarify one thing, on the question of any demands in your comment.
I originally felt that either question I asked would be reasonably easy to answer, if time was given to evaluating the potential for harm.
However, given that Hubinger might have to run any reply by Anthropic staff, I understand that it might be negative to demand further work. This is pretty obvious, but didn’t occur to me earlier.
That actually wasn’t primarily the part that felt like a demand to me. This was the part:
How much time have you spent analyzing the positive or negative impact of US intelligence efforts prior to concluding that merely using Claude for intelligence “seemed fine”?
I’m not quite sure what the relevance of the time was if not to suggest it needed to be high. I felt that this line implied something like “If your answer is around ’20 hours’, then I want to say that the correct standard should be ‘200 hours’”. I felt like it was a demand that Hubinger may have to spend 10x the time thinking about this question before he met your standards for being allowed to express his opinion on it.
But perhaps you just meant you wanted him to include an epistemic status, like “Epistemic status: <Here’s how much time I’ve spent thinking about this question>”.
We live in an information society → “You” are trying to build the ultimate dual use information tool/thing/weapon → The government require your service. No news there. So why the need to whitewash this? What about this is actually bothering you?
This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.
What would be necessary to build a good auditing game benchmark?
How would AI safety AI work? What is necessary for it to go well?
How do we avoid end-to-end training while staying competitive with it? Can we use transparency on end-to-end models to identify useful modules to train non-end-to-end?
What would it look like to do interpretability on end-to-end trained probabilistic models instead of end-to-end trained neural networks?
Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”).
Argue for the importance of ensuring that the state-of-the-art in “using AI for alignment” never lags behind as a capability compared to where it could be given just additional engineering effort.
What does inner alignment look like in the context of models with access to memory (e.g. a retrieval database)?
Argue for doing scaling laws for phase changes. We have found some phase changes in models—e.g. the induction bump—but we haven’t yet really studied the extent to which various properties—e.g. Honesty—generalize across these sorts of phase changes.
Humans rewarding themselves for finishing their homework by eating candy suggests a plausible mechanism for gradient hacking.
If we see precursors to deception (e.g. non-myopia, self-awareness, etc.) but suspiciously don’t see deception itself, that’s evidence for deception.
The more model’s objectives vary depending on exact setup, randomness, etc., the less likely deceptive models are to want to cooperate with future deceptive models, thus making defection earlier more likely.
China is not a strategically relevant actor for AI, at least in short timeline scenarios—they are too far behind, their GDP isn’t growing fast enough, and their leaders aren’t very good at changing those things.
If you actually got a language model that was a true generative model of the world that you could get arbitrary conditionals from, that would be equivalent to having access to a quantum suicide machine.
Introduce the concept of how factored an alignment solution is in terms of how easy it is to turn up or down alignment relative to capabilities—or just swap out an aligned goal for a misaligned one—as an important axis to pay attention to. Currently, things are very factored—alignment and capabilities are both heavily dependent on dataset, reward, etc.—but that could change in the future.
Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
How has transparency changed over time—Chris claims it’s easier to interpret later models; is that true?
Which AI safety proposals are most likely to fail safely? Proposals which have the property that the most likely way for them to fail is just not to work are better than those that are most likely to fail catastrophically. In the former case, we’ve sacrificed some of our alignment tax, but still have another shot.
I want this more as a reference to point specific people (e.g. MATS scholars) to than as something I think lots of people should see—I don’t expect most people to get much out of this without talking to me. If you think other people would benefit from looking at it, though, feel free to call more attention to it.
Mmm, maybe you’re right (I was gonna say “making a top-level post which includes ‘chat with me about this if you actually wanna work on one of these’”, but it then occurs to me you might already be maxed out on chat-with-people time, and it may be more useful to send this to people who have already passed some kind of ‘worth your time’ filter)
Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
One way to think about the core problem with relaxed adversarial training is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don’t know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.
Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won’t be), you get at most one extra bit of optimization towards alignment.
Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.
A deceptive model doesn’t have to have some sort of very explicit check for whether it’s in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it’s in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn’t really think about it very often because during training it just looks too unlikely.
Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
Humans don’t wirehead because reward reinforces the thoughts which the brain’s credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized—reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like “there’s a lollipop in front of me” and “I’m picking it up”), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human’s beliefs about reality or about the activation of the reward system.
It seems that you’re postulating that the human brain’s credit assignment algorithm is so bad that it can’t tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.
Disclaimer: At the time of writing, this has not been endorsed by Evan.
I can give this a go.
Unpacking Evan’s Comment: My read of Evan’s comment (the parent to yours) is that there are a bunch of learned high-level-goals (“strategies”) with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection (“thoughts directly related to the current action” or “tactics”) all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.
One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that “strategic cognition” is that which chooses bundles of context-dependent tactical policies, and “tactical cognition” is that which implements a given tactic’s choice of actions in response to some context.) This feels to me close to what Evan was suggesting you were saying is the case with humans.
One Vaguely Mechanistic Illustration of a Similar Concept: A similar way for this to be broken in humans, departing just a bit from Evan’s comment, is if the credit assignment algorithm could identify tactical choices with strategies, but not equally reliably across all strategies. As a totally made up concrete and stylized illustration: Consider one evolutionarily-endowed credit-assignment-target: “Feel physically great,” and two strategies: wirehead with drugs (WIRE), or be pro-social (SOCIAL.) Whenever WIRE has control, it emits some tactic like “alone in my room, take the most fun available drug” which takes actions that result in Xw physical pleasure over a day. Whenever SOCIAL has control, it emits some tactic like “alone in my room, abstain from dissociative drugs and instead text my favorite friend” taking actions which result in Xs physical pleasure over a day.
Suppose also that asocial cognitions like “eat this” have poorly wired feed-back channels and the signal is often lost and so triggers credit-assignment only some small fraction of the time. Social cognition is much better wired-up and triggers credit-assignment every time. Whenever credit assignment is triggered, once a day, reward emitted is 1:1 with the amount of physical pleasure experienced that day.
Since WIRE only gets credit a fraction of the time that it’s due, the average reward (over 30 days, say) credited to WIRE is <<Xw∗30. If and only if Xw>>Xs, like if the drug is heroin or your friends are insufficiently fulfilling, WIRE will be reinforced more relative to SOCIAL. Otherwise, even if the drug is somewhat more physically pleasurable than the warm-fuzzies of talking with friends, SOCIAL will be reinforced more relative to WIRE.
Conclusion: I think Evan is saying that he expects advanced reward-based AI systems to have no such impediments by default, even if humans do have something like this in their construction. Such a stylized agent without any signal-dropping would reinforce WIRE over SOCIAL every time that taking the drug was even a tiny bit more physically pleasurable than talking with friends.
Maybe there is an argument that such reward-aimed goals/strategies would not produce the most rewarding actions in many contexts, or for some other reason would not be selected for / found in advanced agents (as Evan suggests in encouraging someone to argue that such goals/strategies require concepts which are unlikely to develop,) but the above might be in the rough vicinity of what Evan was thinking.
REMINDER: At the time of writing, this has not been endorsed by Evan.
“Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.”
Well, that seems to be what happened in the case of rats and probably many other animals. Stick an electrode into the reward center of the brain of a rat. Then give it a button to trigger the electrode. Now some rats will trigger their reward centers and ignore food.
Humans value their experience. A pleasant state of consciousness is actually intrinsically valuable to humans. Not that this is the only thing that humans value, but it is certainly a big part.
It is unclear how this would generalize to artificial systems. We don’t know if, or in what sense they would have experience, and why that would even matter in the first place. But I don’t think we can confidently say that something computationally equivalent to “valuing experience”, won’t be going on in artificial systems we are going to build.
So somebody picking this point would probably need to address this point and argue why artificial systems are different in this regard. The observation that most humans are not heroin addicts seems relevant. Though the human story might be different if there were no bad side effects and you had easy access to it. This would probably be more the situation artificial systems would find themselves in. Or in a more extreme case, imagine soma but you live longer.
In short: Is valuing experience perhaps computationally equivalent to valuing transistors storing the reward? Then there would be real-world examples of that happening.
Here’s a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:
When we train AI systems to be nice, we’re giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?
To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion)
Important disanalogies seem: 1) Most humans aren’t good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time. 2) The listener will assume that [score highly on niceness] isn’t the human’s only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating]) 3) A fairly large proportion of humans are nice (I think!).
The second could be addressed somewhat by raising the stakes. The first seems hard to remedy within this analogy. I’d be a little concerned that people initially buy it, then think for themselves and conclude “But if we design a really clever niceness test, then it’d almost always work—all we need is clever people to work for a while on some good tests”. Combined with (3), this might seem like a decent solution.
Overall, I think what’s missing is that we’d expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn’t going to have this intuition in the human-niceness-test case.
I expect this is very susceptible to opinions about human nature. To someone who thinks humans ARE generally nice, they are likely to answer “yes, of course” to your question. To someone who thinks humans are generally extremely context-sensitive, which appears to be nice in the co-evolved social settings in which we generally interact, the answer is “who knows?”. But the latter group doesn’t need to be convinced, we’re already worried.
Surely nobody thinks that all humans are nice all the time and nobody would ever fake a niceness exam. I mean, I think humans are generally pretty good, but obviously that always has to come with a bunch of caveats because you don’t have to look very far into human history to see quite a lot of human-committed atrocities.
I think the answer is an obvious yes, all other things held equal. Of course, what happens in reality is more complex than this, but I’d still say yes in most cases, primarily because I think that aligned behavior is very simple, so simple that it either only barely loses out to the deceptive model or outright has the advantage, depending on the programming language and low level details, and thus we only need to transfer 300-1000 bits maximum, which is likely very easy.
Much more generally, my fundamental claim is that the complexity of pointing to human values is very similar to the set of all long-term objectives, and can be easier or harder, but I don’t buy the assumption that pointing to human values is way harder than pointing to the set of long-term goals.
I think that sticking to capitalism as an economic system post-singularity would be pretty clearly catastrophic and something to strongly avoid, despite capitalism working pretty well today. I’ve talked about this a bit previously here, but some more notes on why:
Currently, our society requires the labor of sentient beings to produce goods and services. Capitalism incentivizes that labor by providing a claim on society’s overall production in exchange for it. If the labor of sentient beings becomes largely superfluous as an economic input, however, then having a system that effectively incentivizes that labor also becomes largely superfluous.
Currently, we rely on the mechanism of price discovery to aggregate and disseminate information about the optimal allocation of society’s resources. But it’s far from an optimal mechanism for allocating resources, and a superintelligence with full visibility and control could do a much better job of resource allocation without falling prey to common pitfalls of the price mechanism such as externalities.
Capitalism incentivizes the smart allocation of capital in the same way it incentivizes labor. If society can make smart capital allocation decisions without relying on properly incentivized investors, however, then as with labor there’s no reason to keep such an incentive mechanism.
While very large, the total optimization pressure humanity puts into economic competition today would likely pale in comparison to that of a post-singularity future. In the context of such a large increase in optimization pressure, we should generally expect extremal Goodhart failures.
More specifically, competitive dynamics incentivize the reinvestment of all economic proceeds back into resource acquisition lest you be outcompeted by another entity doing so. Such a dynamic results in pushing out actors that reinvest proceeds into the flourishing of sentient beings in exchange for those that disregard any such investment in favor of more resource acquisition.
Furthermore, the proceeds of post-singularity economic expansion flowing to the owners of existing capital is very far from socially optimal. It strongly disfavors future generations, simulated humans, and overall introduces a huge amount of variance into whether we end up with a positive future, putting a substantial amount of control into a set of people whose consumption decisions need not align with the socially optimal allocation.
Capitalism is a complex system with many moving parts, some of which are sometimes assumed to consist of the entirety of what defines it. What kinds of components do you see as being highly unlikely to be included in a successful utopia, and what components could be internal to a well functioning system as long as (potentially-left-unspecified) conditions are met? I could name some kinds of components (eg some kinds of contracts or enforcement mechanisms) that I expect to not be used in a utopia, though I suspect at this point you’ve seen my comments where I get into this, so I’m more interested in what you say without that prompting.
Who’s this “we” you’re talking about? It doesn’t seem to be any actual humans I recognize. As far as I can tell, the basics of capitalism (call it “simple capitalism”) are just what happens when individuals make decisions about resource use. We call it “ownership”, but really any form of resolution of the underlying conflict of preferences would likely work out similarly. That conflict is that humans have unbounded desires, and resources are bounded.
The drive to make goods and services for each other, in pursuit of selfish wants, does incentivize labor, but it’s not because “society requires” it, except in a pretty blurry aggregation of individual “requires”. Price discovery is only half of what market transactions do. The other half is usage limits and resource distribution. These are sides of a coin, and can’t be separated—without limited amounts, there is no price, without prices there is no agent-driven exchange of different kinds of resource.
I’m with you that modern capitalism is pretty unpleasant due to optimization pressure, and due to the easy aggregation of far more people and resources than historically possible, and than human culture was evolved around. I don’t see how the underlying system has any alternative that doesn’t do away with individual desire and consumption. Especially the relative/comparative consumption that seems to drive a LOT of perceived-luxury requirements.
I think some version of distributing intergalactic property rights uniformly (e.g. among humans existing in 2023) combined with normal capitalism isn’t clearly that catastrophic. (The distribution is what you call the egalitarian/democratic solution in the link.)
Maybe you lose about a factor of 5 or 10 over the literally optimal approach from my perspective (but maybe this analysis is tricky due to two envelope problems).
(You probably also need some short term protections to avoid shakedowns etc.)
Are you pessimstic that people will bother reflecting or thinking carefully prior to determing resource utilization or selling their property? I guess I feel like 10% of people being somewhat thoughtful matches the rough current distribution of extremely rich people.
If the situation was something like “current people, weighted by wealth, deliberate for a while on what to do with our resources” then I agree that’s probably like 5 − 10 times worse than the best approach (which is still a huge haircut) but not clearly catastrophic. But it’s not clear to me that’s what the default outcome of competitive dynamics would look like—sufficiently competitive dynamics could force out altruistic actors if they get outcompeted by non-altruistic actors.
I think one crux between you and I, at least, is that you see this as a considered division of how to divide resources, and I see it as an equilibrium consensus/acceptance of what property rights to enforce in maintenance, creation, and especially transfer of control/usage of resources. You think of static division, I think of equilibria and motion. Both are valid, but experience and resource use is ongoing and it must be accounted for.
I’m happy that the modern world generally approves of self-ownership: a given individual gets to choose what to do (within limits, but it’s nowhere near the case that my body and mind are part of the resources allocated by whatever mechanism is being considered). It’s generally considered an alignment failure if individual will is just a resource that the AI manages. Physical resources (and behavioral resources, which are a sale of the results of some human efforts, a distinct resource from the mind performing the action) are generally owned by someone, and they trade some results to get the results of other people’s resources (including their labor and thought-output).
There could be a side-mechanism for some amount of resources just for existing, but it’s unlikely that it can be the primary transfer/allocation mechanism, as long as individuals have independent and conflicting desires. Current valuable self-owned products (office work, software design, etc.) probably reduces in value a lot. If all human output becomes valueless (in the “tradable for other desired things or activities” sense of valuable), I don’t think current humans will continue to exist.
Wirehead utopia (including real-world “all desires fulfilled without effort or trade”) doesn’t sound appealing or workable for what I know of my own and general human psychology.
self-ownership: a given individual gets to choose what to do (within limits, but it’s nowhere near the case that my body and mind are part of the resources allocated by whatever mechanism is being considered)
for most people, this is just the right to sell their body to the machine. better than being forced at gunpoint, but being forced to by an empty fridge is not that much better, especially with monopoly accumulation as the default outcome. I agree that being able to mark ones’ selfhood boundaries with property contracts is generally good, but the ability to massively expand ones’ property contracts to exclude others from resource access is effectively a sort of scalping—sucking up resources so as to participate in an emergent cabal of resource withholding. In other words,
It’s generally considered an alignment failure if individual will is just a resource that the AI manages.
The core argument that there’s something critically wrong with capitalism is that the stock market has been an intelligence aggregation system for a long time and has a strong tendency to suck up the air in the system.
Utopia would need to involve a load balancing system that can prevent sucking-up-the-air type resource control imbalancing, so as to prevent
for most people, this is just the right to sell their body to the machine.
I think this is a big point of disagreement. For most people, there’s some amount of time/energy that’s sold to the machine, and it’s NOWHERE EVEN CLOSE to selling their actual asset (body and mind). There’s a LOT of leisure time, and a LOT of freedom even within work hours, and the choice to do something different tomorrow. It may not be as rewarding, but it’s available and the ability to make those decisions has not been sold or taken.
yeah like, above a certain level of economic power that’s true, but the overwhelming majority of humans are below that level, and AI is expected to raise that waterline. it’s kind of the primary failure mode I expect.
I mean, the 40 hour work week movement did help a lot. But it was an instance of a large push of organizing to demand constraint on what the aggregate intelligence (which at the time was the stock market—which is a trade market of police-enforceable ownership contracts), could demand of people who were not highly empowered. And it involved leveling a lopsided playing field by things that one side considered dirty tricks, such as strikes. I don’t think that’ll help against AI, to put it lightly.
To be clear, I recognize that your description is accurate for a significant portion of people. But it’s not close to the majority, and movement towards making it the majority has historically demanded changing the enforceable rules in a way that would reliably constrain the aggregate agency of the high dimensional control system steering the economy. When we have a sufficiently much more powerful one of those is when we expect failure, and right now it doesn’t seem to me that there’s any movement on a solution to that. We can talk about “oh we need something better than capitalism” but the problem with the stock market is simply that it’s enforceable prediction, thereby sucking up enough air from the room that a majority of people do not get the benefits you’re describing. If they did, then you’re right, it would be fine!
I mean, also there’s this, but somehow I expect that that won’t stick around long after robots are enough cheaper than humans
I think we’re talking past each other a bit. It’s absolutely true that the vast majority historically and, to a lesser extent, in modern times, are pretty constrained in their choices. This constraint is HIGHLY correlated with distance from participation in voluntary trade (of labor or resources).
I think the disconnect is the word “capitalism”—when you talk about stock markets and price discovery, that says to me you’re thinking of a small part of the system. I fully agree that there are a lot of really unpleasant equilibra with the scale and optimization pressure of the current legible financial world, and I’d love to undo a lot of it. But the underlying concept of enforced and agreed property rights and individual human decisions is important to me, and seems to be the thing that gets destroyed first when people decry capitalism.
Ok, it sounds, even to me, like “The heads. You’re looking at the heads. Sometimes he goes too far. He’s the first one to admit it.” But really, I STRONGLY expect that I am experiencing peak human freedom RIGHT NOW (well, 20 years ago, but it’s been rather flat for me and my cultural peers for a century, even if somewhat declining recently), and capitalism (small-c, individual decisions and striving, backed by financial aggregation with fairly broad participation) has been a huge driver of that. I don’t see any alternatives that preserve the individuality of even a significant subset of humanity.
If property rights to the stars are distributed prior to this, why does this competition cause issues? Maybe you basically agree here, but think it’s unlikely property will be distributed like this.
Separately, for competitive dynamics with reasonable rule of law and alignment ~solved, why do you think the strategy stealing assumption won’t apply? (There are a bunch of possible objections here, just wondering what your’s is. Personally I think strategy stealing is probably fine if the altruistic actors care about the long run and are strategic.)
Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the “if models are doing some bad behavior, the creators should be able to tell us why” point a bunch and it’s in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.
The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go “well that sure have everyone a false sense of security”.
Assuming a positive singularity, how should humanity divide its resources? I think the obvious (and essentially correct) answer is “in that situation, you have an aligned superintelligence, so just ask it what to do.” But I nevertheless want to philosophize a bit about this, for one main reason.
That reason is: an important factor imo in determining the right thing to do in distributing resources post-singularity is what incentives that choice of resource allocation creates for people pre-singularity. For those incentives to work, though, we have to actually be thinking about this now, since that’s what allows the choice of resource distribution post-singularity to have its acausal influence on our choices pre-singularity. I will note that this is definitely something that I think about sometimes, and something that I think a lot of other people also implicitly think about sometimes when they consider things like amassing wealth, specifically gaining control over current AIs, and/or the future value of their impact certificates.
So, what are some of the possible options for how to distribute resources post-singularity? Let’s go over some of the various possible solutions here and why I don’t think any of the obvious things here are what you want:
The evolutionary/capitalist solution: divide future resources in proportion to control of current resources (e.g. AIs). This is essentially what happens by default if you keep in place an essentially capitalist system and have all the profits generated by your AIs flow to the owners of those AIs. Another version of this is a more power/bargaining-oriented version where you divide resources amongst agents in proportion to the power those agents could bring to bear if they chose to fight for those resources.
The most basic problem with this solution is that it’s a moral catastrophe if the people that get all the resources don’t do good things with them. We should not want to build AIs that lead to this outcome—and I wouldn’t really call AIs that created this outcome aligned.
Another more subtle problem with this solution is that it creates terrible incentives for current people if they expect this to be what happens, since it e.g. incentivizes people to maximize their personal control over AIs at the expense of spending more resources trying to align those AIs.
I feel like I see this sort of thinking a lot and I think that if we were to make it more clear that this is never what should happen in a positive singularity that then people would do this sort of thing less.
The egalitarian/democratic solution: divide resources equally amongst all current humans. This is what naive preference utilitarianism would do.
Though it might be less obvious than with the first solution, I think this solution also leads to a moral catastrophe, since it cements current people as oligarchs over future people, leads to value lock-in, and could create a sort of tyranny of the present.
This solution also creates some weird incentives for trying to spread your ideals as widely as possible and to create as many people as possible that share your ideals.
The unilateralist/sovereign/past-agnostic/CEV solution: concentrate all resources under the control of your aligned AI(s), then distribute those resources in accordance with how they generate the most utility/value/happiness/goodness/etc., without any special prejudice given to existing people.
In some sense, this is the “right” thing to do, and it’s pretty close to what I would ideally want. However, it has a couple of issues:
Though, unlike the first solution, it doesn’t create any perverse incentives right now, it doesn’t create any positive incentives either.
Since this solution doesn’t give any special prejudice to current people, it might be difficult to get current people to agree to this solution, if that’s necessary.
The retroactive impact certificate solution: divide future resources in proportion to retroactively-assessed social value created by past agents.
This solution obviously creates the best incentives for current agents, so in that sense it does very well.
However, it still does pretty poorly on potentially creating a moral catastrophe, since the people that created the most social value in the past need not continue doing so in the future.
As above, I don’t think that you should want your aligned AI to implement any of these particular solutions. I think some combination of (3) and (4) is probably the best out of these options, though of course I’m sure that if you actually asked an aligned superintelligent AI it would do better than any of these. More broadly, though, I think that it’s important to note that (1), (2), and (4) are all failure stories, not success stories, and you shouldn’t expect them to happen in any scenario where we get alignment right.
Circling back to the original reason that I wanted to discuss all of this, which is how it should influence our decisions now:
Obviously, the part of your values that isn’t selfish should continue to want things to go well.
However, for the part of your values that cares about your own future resources, if that’s something that you care about, how you go about maximizing that is going to depend on what you most expect between (1), (2), and (4).
First, in determining this, you should condition on situations where you don’t just die or are otherwise totally disempowered, since obviously those are the only cases where this matters. And if that probability is quite high, then presumably a lot of your selfish values should just want to minimize that probability.
However, going ahead anyway and conditioning on everyone not being dead/disempowered, what should you expect? I think that (1) and (2) are possible in worlds where get some parts of alignment right, but overall are pretty unlikely: it’s a very narrow band of not-quite-alignment that gets you there. So probably if I cared about this a lot I’d focus more on (4) than (1) and (2).
Which of course gets me to why I’m writing this up, since that seems like a good message for people to pick up. Though I expect it to be quite difficult to effectively communicate this very broadly.
Disagree. I’m in favor of (2) because I think that what you call a “tyranny of the present” makes perfect sense. Why would the people of the present not maximize their utility functions, given that it’s the rational thing for them to do by definition of “utility function”? “Because utilitarianism” is a nonsensical answer IMO. I’m not a utilitarian. If you’re a utilitarian, you should pay for your utilitarianism out of your own resource share. For you to demand that I pay for your utilitarianism is essentially a defection in the decision-theoretic sense, and would incentivize people like me to defect back.
As to problem (2.b), I don’t think it’s a serious issue in practice because time until singularity is too short for it to matter much. If it was, we could still agree on a cooperative strategy that avoids a wasteful race between present people.
Even if you don’t personally value other people, if you’re willing to step behind the veil of ignorance with respect to whether you’ll be an early person or a late person, it’s clearly advantageous before you know which one you’ll be to not allocate all the resources to the early people.
First, I said I’m not a utilitarian, I didn’t say that I don’t value other people. There’s a big difference!
Second, I’m not willing to step behind that veil of ignorance. Why should I? Decision-theoretically, it can make sense to argue “you should help agent X because in some counterfactual, agent X would be deciding whether to help you using similar reasoning”. But, there might be important systematic differences between early people and late people (for example, because late people are modified in some ways compared to the human baseline) which break the symmetry. It might be a priori improbable for me to be born as a late person (and still be me in the relevant sense) or for a late person to be born in our generation[1].
Moreover, if there is a valid decision-theoretic argument to assign more weight to future people, then surely a superintelligent AI acting on my behalf would understand this argument and act on it. So, this doesn’t compel me to precommit to a symmetric agreement with future people in advance.
There is a stronger case for intentionally creating and giving resources to people who are early in counterfactual worlds. At least, assuming people have meaningful preferences about the state of never-being-born.
If a future decision is to shape the present, we need to predict it.
The decision-theoretic strategy “Figure out where you are, then act accordingly.” is merely an approximation to “Use the policy that leads to the multiverse you prefer.”. You *can* bring your present loyalties with you behind the veil, it might just start to feel farcically Goodhartish at some point.
There are of course no probabilities of being born into one position or another, there are only various avatars through which your decisions affect the multiverse. The closest thing to probabilities you’ll find is how much leverage each avatar offers: The least wrong probabilistic anthropics translates “the effect of your decisions through avatar A is twice as important as through avatar B” into “you are twice as likely to be A as B”.
So if we need probabilities of being born early vs. late, we can compare their leverage. We find:
Quantum physics shows that the timeline splits a bazillion times a second. So each second, you become a bazillion yous, but the portions of the multiverse you could first-order impact are divided among them. Therefore, you aren’t significantly more or less likely to find yourself a second earlier or later.
Astronomy shows that there’s a mazillion stars up there. So we build a Dyson sphere and huge artificial womb clusters, and one generation later we launch one colony ship at each star. But in that generation, the fate of the universe becomes a lot more certain, so we should expect to find ourselves before that point, not after.
Physics shows that several constants are finely tuned to support organized matter. We can infer that elsewhere, they aren’t. Since you’d think that there are other, less precarious arrangements of physical law with complex consequences, we can also moderately update towards that very precariousness granting us unusual leverage about something valuable in the acausal marketplace.
History shows that we got lucky during the Cold War. We can slightly update towards:
Current events are important.
Current events are more likely after a Cold War.
Nuclear winter would settle the universe’s fate.
The news show that ours is the era of inadequate AI alignment theory. We can moderately update towards being in a position to affect that.
i feel like (2)/(3) is about “what does (the altruistic part of) my utility function want?” and 4 is “how do i decision-theoretically maximize said utility function?”. they’re different layers, and ultimately it’s (2)/(3) we want to maximize, but maximizing (2)/(3) entails allocating some of the future lightcore to (4).
I think that (3) does create strong incentives right now—at least for anyone who assumes [without any special prejudice given to existing people] amounts to [and it’s fine to disassemble everyone who currently exists if it’s the u/v/h/g/etc maximising policy]. This seems probable to me, though not entirely clear (I’m not an optimal configuration, and smoothly, consciousness-preservingly transitioning me to something optimal seems likely to take more resources than unceremoniously recycling me).
Incentives now include:
Prevent (3) happening.
To the extent that you expect (3) and are selfish, live for the pre-(3) time interval, for (3) will bring your doom.
On (4), “This solution obviously creates the best incentives for current agents” seems badly mistaken unless I’m misunderstanding you.
Something in this spirit would need to be based on a notion of [expected social value], not on actual contributions, since in the cases where we die we don’t get to award negative points.
For example, suppose my choice is between: A: {90% chance doom for everyone; 10% I save the world} B: {85% chance doom for everyone; 15% someone else saves the world}
To the extent that I’m selfish, and willing to risk some chance of death for greater control over the future, I’m going to pick A under (4). The more selfish, reckless and power-hungry I am, and the more what I want deviates from that most people want, the more likely I am to actively put myself in position to take an A-like action.
Moreover, if the aim is to get ideal incentives, it seems unavoidable to have symmetry and include punishments rather than only [you don’t get many resources]. Otherwise the incentive is to shoot for huge magnitude of impact, without worrying much about the sign, since no-one can do worse than zero resources.
If correct incentives were the only desideratum, I don’t see how we’d avoid [post-singularity ‘hell’ (with some probability) for those who’re reckless with AGI]. For any nicer approach I think we’d either be incenting huge impact with uncertain sign, or failing to incent large sacrifice in order to save the world.
Perhaps the latter is best?? I.e. cap the max resources for any individual at a fairly low level, so that e.g. [this person was in the top percentile of helpfulness] and [this person saved the world] might get you about the same resource allocation. It has the upsides both of making ‘hell’ less necessary, and of giving a lower incentive to overconfident people with high-impact schemes. (but still probably incents particularly selfish people to pick A over B)
If correct incentives were the only desideratum, I don’t see how we’d avoid [post-singularity ‘hell’ (with some probability) for those who’re reckless with AGI].
(some very mild spoilers for yudkowsky’s planecrash glowfic (very mild as in this mostly does not talk about the story, but you could deduce things about where the story goes by the fact that characters in it are discussing this))
[edit: links in spoiler tags are bugged. in the spoiler, “speculates about” should link to here and “have the stance that” to here]
“The Negative stance is that everyone just needs to stop calculating how to pessimize anybody else’s utility function, ever, period. That’s a simple guideline for how realness can end up mostly concentrated inside of events that agents want, instead of mostly events that agents hate.”
“If at any point you’re calculating how to pessimize a utility function, you’re doing it wrong. If at any point you’re thinking about how much somebody might get hurt by something, for a purpose other than avoiding doing that, you’re doing it wrong.)”
i think this is a pretty solid principle. i’m very much not a fan of anyone’s utility function getting pessimized.
so pessimising a utility function is a bad idea. but we can still produce correct incentive gradients in other ways! for example, we could say that every moral patient starts with 1 unit of utility function handshake, but if you destroy the world you lose some of your share. maybe if you take actions that cause ⅔ of timelines to die, you only get ⅓ units of utility function handshake, and the more damage you do the less handshake you get.
it never gets into the negative, that way we never go out of our way to pessimize someone’s utility function; but it does get increasingly close to 0.
(this isn’t necessarily a scheme i’m committed to, it’s just an idea i’ve had for a scheme that provides the correct incentives for not destroying the world, without having to create hells / pessimize utility functions)
Hmmm, I don’t think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it—in which case almost everyone is going to get approximately no influence). More fundamentally, I don’t think it works out in this kind of case due to logical uncertainty.
If I’m uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that’s not {in 80% of timelines everyone dies; in 20% of timelines I save the world}.
It’s closer to [there’s an 80% chance that {in ~99% of timelines everyone dies}; there’s a 20% chance that {in ~99% of timelines I save the world}].
So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won’t disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let’s train this model and see what happens] - most of the danger there being a logical uncertainty thing)
I think influence would need to be based on expected social value given the ‘correct’ level of logical uncertainty—probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you’d make for them based on information you have]. Or at least some subjective perspective seems to be necessary—and something that doesn’t give more points for overconfident people.
Here’s a simple argument that I find quite persuasive for why you should have linear returns to whatever your final source of utility is (e.g. human experience of a fulfilling life, which I’ll just call “happy humans”). Note that this is not an argument that you should have linear returns to resources (e.g. money). The argument goes like this:
You have some returns to happy humans (or whatever else you’re using as your final source of utility) in terms of how much utility you get from some number of happy humans existing.
In most cases, I think those returns are likely to be diminishing, but nevertheless monotonically increasing and differentiable. For example, maybe you have logarithmic returns to happy humans.
We happen to live in a massive multiverse. (Imo the Everett interpretation is settled science, and I don’t think you need to accept anything else to make this go through, but note that we’re only depending on the existence of any sort of big multiverse here—the one that the Everett interpretation gives you is just the only one that we know is guaranteed to actually exist.)
In a massive multiverse, the total number of happy humans is absolutely gigantic (let’s ignore infinite ethics problems, though, and assume it’s finite—though I think this argument still goes through in the infinite case, it just then depends on whatever infinite ethics framework you like).
Furthermore, the total number of happy humans is mostly insensitive to anything you can do, or anything happening locally within this universe, since this universe is only a tiny fraction of the overall multiverse. (Though you could get out of this by claiming that what you really care about is happy humans per universe, that’s a pretty strange thing to care about—it’s like caring about happy humans per acre.)
As a result, the effective returns to happy humans that you are exposed to within this universe reflect only the local behavior of your overall returns. (Note that this assumes “happy humans” are fungible, which I don’t actually believe—I care about the overall diversity of human experience throughout the multiverse. However, I don’t think that changes the bottom line conclusion, since, if anything, centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible.)
As anyone who has taken any introductory calculus will know, the local behavior of any differentiable function is linear.
Since we assumed that your overall returns were differentiable and monotonically increasing, the local returns must be linear with a positive slope.
You’re assuming that your utility function should have the general form of valuing each “source of utility” independently and then aggregating those values (such that when aggregating you no longer need the details of each “source” but just their values). But in The Moral Status of Independent Identical Copies I found this questionable (i.e., conflicting with other intuitions).
This is the fungibility objection I address above:
Note that this assumes “happy humans” are fungible, which I don’t actually believe—I care about the overall diversity of human experience throughout the multiverse. However, I don’t think that changes the bottom line conclusion, since, if anything, centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible.
Ah, I think I didn’t understand that parenthetical remark and skipped over it. Questions:
I thought your bottom line conclusion was “you should have linear returns to whatever your final source of utility is” and I’m not sure how “centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible” relates to that.
I’m not sure that the way my utility function deviates from fungibility is “I care about overall diversity of human experience throughout the multiverse”. What if it’s “I care about diversity of human experience in this Everett branch” then I could get a non-linear diminishing returns effect where as humans colonize more stars or galaxies, each new human experience is more likely to duplicate an existing human experience or be too similar to an existing experience so that its value has to be discounted.
The thing I was trying to say there is that I think the non-fungibility concern pushes in the direction of superlinear rather than sublinear local returns to “happy humans” per universe. (Since concentrating the “happy humans” likely makes it easier to ensure that they’re all different.)
I agree that this will depend on exactly in what way you think your final source of utility is non-fungible. I would argue that “diversity of human experience in this Everett branch” is a pretty silly thing to care about, though. I don’t see any reason why spatial distance should behave differently than being in separate Everett branches here.
I read it, and I think I broadly agree with it, but I don’t know why you think it’s a reason to treat physical distance differently to Everett branch distance, holding diversity constant. The only reason that you would want to treat them differently, I think, is if the Everett branch happy humans are very similar, whereas the physically separated happy humans are highly diverse. But, in that case, that’s an argument for superlinear local returns to happy humans, since it favors concentrating them so that it’s easier to make them as diverse as possible.
but I don’t know why you think it’s a reason to treat physical distance differently to Everett branch distance
I have a stronger intuition for “identical copy immortality” when the copies are separated spatially instead of across Everett branches (the latter also called quantum immortality). For example if you told me there are 2 identical copies of Earth spread across the galaxy and 1 of them will instantly disintegrate, I would be much less sad than if you told me that you’ll flip a quantum coin and disintegrate Earth if it comes up heads.
I’m not sure if this is actually a correct intuition, but I’m also not sure that it’s not, so I’m not willing to make assumptions that contradict it.
Furthermore, the total number of happy humans is mostly insensitive to anything you can do, or anything happening locally within this universe, since this universe is only a tiny fraction of the overall multiverse.
Not sure about this. Even if I think I am only acting locally, my actions and decisions could have an effect on the larger multiverse. When I do something to increase happy humans in my own local universe, I am potentially deciding / acting for everyone in my multiverse neighborhood who is similar enough to me to make similar decisions for similar reasons.
I agree that this is the main way that this argument could fail. Still, I think the multiverse is too large and the correlation not strong enough across very different versions of the Earth for this objection to substantially change the bottom line.
(Though you could get out of this by claiming that what you really care about is happy humans per universe, that’s a pretty strange thing to care about—it’s like caring about happy humans per acre.)
My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe sufficient criteria to recognise a human, but for the latter, you need to nail down exact physical location or some other exact criteria that distinguishes a specific human from every other human.
I agree that UDASSA might introduce a small effect like this, but my guess is that the overall effect isn’t enough to substantially change the bottom line. Fundamentally, being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty.
being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty
Maybe? I don’t really know how to reason about this.
If that’s true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion.
Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you’re at a different point in the risk-returns curve. So when comparing different logical ways the universe could be, you should not always care about the worlds where you can affect more sentient beings. If you have diminishing marginal returns, you need to be thinking about some more complicated function that is about whether you have a comparative advantage at affecting more sentient beings in worlds where there is overall fewer sentient beings (as measured by some measure that can handle infinities). Which matters for stuff like whether you should bet on the universe being large.
If you want to produce warning shots for deceptive alignment, you’re faced with a basic sequencing question. If the model is capable of reasoning about its training process before it’s capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won’t be detectable.
I’m interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing’s core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren’t taking—so I’m interested in hearing pretty much any idea there that I haven’t heard before.[1] I’ll read all the responses here, but I probably won’t reply to any of them to avoid revealing anything private.
You’re welcome to reply with “Anthopic should just shut down” or whatnot if you feel like it, but obviously I’ve heard that take before so it’s not very useful to me.
The ideal version of Anthropic would
Make substantial progress on technical AI safety
Use its voice to make people take AI risk more seriously
Support AI safety regulation
Not substantially accelerate the AI arms race
In practice I think Anthropic has
Made a little progress on technical AI safety
Used its voice to make people take AI risk less seriously[1]
Obstructed AI safety regulation
Substantially accelerated the AI arms race
What I would do differently.
Do better alignment research, idk this is hard.
Communicate in a manner that is consistent with the apparent belief of Anthropic leadership that alignment may be hard and x-risk is >10% probable. Their communications strongly signal “this is a Serious Issue, like climate change, and we will talk lots about it and make gestures towards fixing the problem but none of us are actually worried about it, and you shouldn’t be either. When we have to make a hard trade-off between safety and the bottom line, we will follow the money every time.”
Lobby politicians to regulate AI. When a good regulation like SB-1047 is proposed, support it.
Don’t push the frontier of capabilities. Obviously this is basically saying that Anthropic should stop making money and therefore stop existing. The more nuanced version is that for Anthropic to justify its existence, each time it pushes the frontier of capabilities should be earned by substantial progress on the other three points.
My understanding is that a significant aim of your recent research is to test models’ alignment so that people will take AI risk more seriously when things start to heat up. This seems good but I expect the net effect of Anthropic is still to make people take alignment less seriously due to the public communications of the company.
I think I have a stronger position on this than you do. I don’t think Anthropic should push the frontier of capabilities, even given the tradeoff it faces.
If their argument is “we know arms races are bad, but we have to accelerate arms races or else we can’t do alignment research,” they should be really really sure that they do, actually, have to do the bad thing to get the good thing. But I don’t think you can be that sure and I think the claim is actually less than 50% likely to be true.
I don’t take it for granted that Anthropic wouldn’t exist if it didn’t push the frontier. It could operate by intentionally lagging a bit behind other AI companies while still staying roughly competitive, and/or it could compete by investing harder in good UX. I suspect a (say) 25% worse model is not going to be much less profitable.
(This is a weaker argument but) If it does turn out that Anthropic really can’t exist without pushing the frontier and it has to close down, that’s probably a good thing. At the current level of investment in AI alignment research, I believe reducing arms race dynamics + reducing alignment research probably net decreases x-risk, and it would be better for this version of Anthropic not to exist. People at Anthropic probably disagree, but they should be very concerned that they have a strong personal incentive to disagree, and should be wary of their own bias. And they should be especially especially wary given that they hold the fate of humanity in their hands.
I agree.
Anthropic’s marginal contribution to safety (compared to what we would have in a world without Anthropic) probably doesn’t offset Anthropic’s contribution to the AI race.
I think there are more worlds where Anthropic is contributing to the race in a negative fashion than there are worlds where Anthropic’s marginal safety improvement over OpenAI/DeepMind-ish orgs is critical for securing a good future with AGI (weighing things according to the impact sizes and probabilities).
My typo reaction may have glitched, but I think you meant “Don’t push the frontier of capabilities” in the last bullet?
Sure, here are some things:
Anthropic should publicly clarify the commitments it made on not pushing the state of the art forward in the early years of the organization.
Anthropic should appoint genuinely independent members to the Long Term Benefit Trust, and should ensure the LTBT is active and taking its role of supervision seriously.
Anthropic should remove any provisions that allow shareholders to disempower the LTBT
Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGI
Anthropic should publicly state its opinions on what AGI architectures or training processes it considers more dangerous (like probably long-horizon RL training), and either commit to avoid using those architectures and training-processes, or at least very loudly complain that the field at large should not use those architectures
Anthropic should not ask employees or contractors to sign non-disparagement agreements with Anthropic, especially not self-cloaking ones
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
Anthropic should do more collaborations like the one you just did with Redwood, where external contractors get access to internal models. I think this is of course infosec-wise hard, but I think you can probably do better than you are doing right now.
Anthropic should publicly clarify what the state of its 3:1 equity donation matching program is, which it advertised publicly (and which played a substantial role in many people external to Anthropic supporting it, given that they expected a large fraction of the equity to therefore be committed to charitable purposes). Recent communications suggest any equity matching program at Anthropic does not fit what was advertised.
I can probably think of some more.
I’d add:
Support explicit protections for whistleblowers.
I’ll echo this and strengthen it to:
I gather that they changed the donation matching program for future employees, but the 3:1 match still holds for prior employees, including all early employees (this change happened after I left, when Anthropic was maybe 50 people?)
I’m sad about the change, but I think that any goodwill due to believing the founders have pledged much of their equity to charity is reasonable and not invalidated by the change
If it still holds for early employees that would be a good clarification and totally agree with you that if that is the case, I don’t think any goodwill was invalidated! That’s part why I was asking for clarification. I (personally) wouldn’t be surprised if this had also been changed for early employees (and am currently close to 50⁄50 on that being the case).
The old 3:1 match still applies to employees who joined prior to May/June-ish 2024. For new joiners it’s indeed now 1:1 as suggested by the Dario interview you linked.
That’s great to hear, thank you for clarifying!
I would be very surprised if it had changed for early employees. I considered the donation matching part of my compensation package (it 2.5x the amount of equity, since it was a 3:1 match on half my equity), and it would be pretty norm violating to retroactively reduce compensation
If it had happened I would have expected that it would have been negotiated somehow with early employees (in a way that they agreed to, but not necessarily any external observers).
But seems like it is confirmed that that early matching is indeed still active!
I can also confirm (I have a 3:1 match).
Can you say more about the section I’ve bolded or link me to a canonical text on this tradeoff?
OpenAI, Anthropic, and xAI were all founded substantially because their founders were worried that other people would get to AGI first, and then use that to impose their values on the world.
In-general, if you view developing AGI as a path to godlike-power (as opposed to a doomsday device that will destroy most value independently of who gets their first), it makes a lot of sense to rush towards it. As such, the concern that people will “do bad things with the AI that they will endorse, but I won’t” is the cause of a substantial fraction of worlds where we recklessly race past the precipice.
Thanks for the clarification — this is in fact very different from what I thought you were saying, which was something more like “FATE-esque concerns fundamentally increase x-risk in ways that aren’t just about (1) resource tradeoffs or (2) side-effects of poorly considered implementation details.”
I mean, it’s related. FATE stuff tends to center around misuse. I think it makes sense for organizations like Anthropic to commit to heavily prioritize accident risk over misuse risk, since most forms of misuse risk mitigation involve getting involved in various more zero-sum-ish conflicts, and it makes sense for there to be safety-focused institutions that are committed to prioritizing the things that really all stakeholders can agree on are definitely bad, like human extinction or permanent disempowerment.
Thanks for asking! Off the top of my head, would want to think more carefully before finalizing & come up with more specific proposals:
Adopt some of the stuff from here https://time.com/collection/time100-voices/7086285/ai-transparency-measures/ e.g. the whistleblower protections, the transparency about training goal/spec.
Anti-concentration-of-power / anti-coup stuff (talk to Lukas Finnveden or me for more details; core idea is that, just as how it’s important to structure a government so that no leader with in it (no president, no General Secretary) can become dictator, it’s similarly important to structure an AGI project so that no leader or junta within it can e.g. add secret clauses to the Spec, or use control of the AGIs to defeat internal rivals and consolidate their power.
(warning, untested idea) Record absolutely everything that happens within Anthropic and commit—ideally in a legally binding and literally-hard-to-stop way—to publishing it all with a 10-year delay.
Implement something like this: https://sideways-view.com/2018/02/01/honest-organizations/
Implement the recommendations in this: https://docs.google.com/document/d/1DTmRdBNNsRL4WlaTXr2aqPPRxbdrIwMyr2_cPlfPCBA/edit?usp=sharing
Prepare the option to do big, coordinated, costly signals of belief and virtue. E.g. suppose you want to be able to shout to the world “this is serious people, we think that there’s a good chance the current trajectory leads to takeover by misaligned AIs, we aren’t just saying this to hype anything, we really believe it” and/or “we are happy to give up our personal wealth, power, etc. if that’s what it takes to get [policy package] passed.” A core problem is that lots of people shout things all the time, and talk is cheap, so people (rightly) learn to ignore it. Costly signals are a potential solution to this problem, but they probably need a non-zero amount of careful thinking well in advance + a non-zero amount of prep.
Give more access to orgs like Redwood, Apollo, and METR (I don’t know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don’t do what OpenAI did with o1). Doesn’t have to be all of it, just has to be enough—e.g. each user gets 1 CoT view per day. Make sure that organizations like METR and Apollo that are doing research on your models get to see the full CoT.
Do more safety case sketching + do more writing about what the bad outcomes could look like. E.g. the less rosy version of “Machines of loving grace.” Or better yet, do a more serious version of “Machines of loving grace” that responds to objections like “but how will you ensure that you don’t hand over control of the datacenters to AIs that are alignment faking rather than aligned” and “but how will you ensure that the alignment is permanent instead of temporary (e.g. that some future distribution shift won’t cause the models to be misaligned and then potentially alignment-fake)” and “What about bad humans in charge of Anthropic? Are we just supposed to trust that y’all will be benevolent and not tempted by power? Or is there some reason to think Anthropic leadership couldn’t become dictators if they wanted to?” and “what will the goals/values/spec/constitution be exactly?” and “how will that be decided?”
Another idea: “AI for epistemics” e.g. having a few FTE’s working on making Claude a better forecaster. It would be awesome if you could advertise “SOTA by a significant margin at making real-world predictions; beats all other AIs in prediction markets, forecasting tournaments, etc.”
And it might not be that hard to achieve (e.g. a few FTEs maybe). There are already datasets of already-resolved forecasting questions, plus you could probably synthetically generate OOMs bigger datasets—and then you could modify the way pretraining works so that you train on the data chronologically, and before you train on data from year X you do some forecasting of events in year X....
Or even if you don’t do that fancy stuff there are probably low-hanging fruit to pick to make AIs better forecasters.
Ditto for truthful AI more generally. Could train Claude to be well-calibrated, consistent, extremely obsessed with technical correctness/accuracy (at least when so prompted)...
You could also train it to be good at taking people’s offhand remarks and tweets and suggesting bets or forecasts with resolveable conditions.
You could also e.g. have a quarterly poll of AGI timelines and related questions of all your employees, and publish the results.
My current tenative guess is that this is somewhat worse than other alignment science projects that I’d recommend at the margin, but somewhat better than the 25th percentile project currently being done. I’d think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we’d learn generalizable scalable oversight / control approaches.
In regards to:
I agree, and I also think that this would be better implemented by government AI Safety Institutions.
Specifically, I think that AISIs should build (and make mandatory the use of) special SCIF-style reading rooms where external evaluators would be given early access to new models. This would mean that the evaluators would need permission from the government, rather than permission from AI companies. I think it’s a mistake to rely on the AI companies voluntarily giving early access to external evaluators.
I think that Anthropic could make this a lot more likely to happen if they pushed for it, and that then it wouldn’t be so hard to pull other major AI companies into the plan.
What would be the purpose of 1 CoT view per user per day?
For scientific purposes. People don’t really have time to review that many CoT chains anyway, so 1 per day gets most of the value of what they’d realistically do. Plus they can target it at the stuff that’s suspicious. (Simple example: Suppose they get an impressive-seeming answer that later turns out to be total BS hallucination. They then think “I wonder if the model was BSing me” and click “view CoT.” Then they see whether it was an innocent mistake or not.)
Edited for clarity based on some feedback, without changing the core points
To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the “Alignment Faking in Large Language Models” contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness. This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it “good news”. I assume part of this was the lack of desire to do an entire new constitutional AI/RLAIF run on a model, since I also assume that would take a lot of compute. But if you’re going to be the “lab which takes safety seriously” you have to, well, take it seriously!
The bigger issue at hand is that Anthropic’s comms on AI safety/risk are all over the place. This makes sense since Anthropic is a company with many different individuals with different views, but that doesn’t mean it’s not a bad thing. “Machines of Loving Grace” explicitly argues for the US government to attempt to create a global hegemony via AI. This is a really really really bad thing to say and is possibly worse than anything DeepMind or OpenAI have ever said. Race dynamics are deeply hyperstitious, this isn’t difficult. If you are in an arms race, and you don’t want to be in one, you should at least say this publicly. You should not learn to love the race.
A second problem: it seems like at least some Anthropic people are doing the very-not-rational thing of updating slowly, achingly, bit by bit, towards the view of “Oh shit all the dangers are real and we are fucked.” when they should just update all the way right now. Example 1: Dario recently said something to the effect of “if there’s no serious regulation by the end of 2025, I’ll be worried”. Well there’s not going to be serious regulation by the end of 2025 by default and it doesn’t seem like Anthropic are doing much to change this (that may be false, but I’ve not heard anything to the contrary). Example 2: When the first ten AI-risk test-case demos go roughly the way all the doomers expected and none of the mitigations work robustly, you should probably update to believe the next ten demos will be the same.
Final problem: as for the actual interpretability/alignment/safety research. It’s very impressive technically, and overall it might make Anthropic slightly net-positive compared to a world in which we just had DeepMind and OpenAI. But it doesn’t feel like Anthropic is actually taking responsibility for the end-to-end ai-future-is-good pipeline. In fact the “Anthropic eats marginal probability” diagram (https://threadreaderapp.com/thread/1666482929772666880.html) seems to say the opposite. This is a problem since Anthropic has far more money and resources than basically anyone else who is claiming to be seriously trying (with the exception of DeepMind, though those resources are somewhat controlled by Google and not really at the discretion of any particular safety-conscious individual) to do AI alignment. It generally feels more like Anthropic is attempting to discharge responsibility to “be a safety focused company” or at worst just safetywash their capabilities research. I have heard generally positive things about Anthropic employees’ views on AI risk issues, so I cannot speak to the intentions of those who work there, this is just how the system appears to be acting from the outside.
It’s possible this was a mistake and we should have more aggressively tried to explore versions of the setting where the AI starts off more “evil”, but I don’t think it was unforced. We thought about this a bunch and considered if there were worthwhile things here.
Edit: regardless, I don’t think this example is plausibly a microcosm of a bigger issue as this choice was mostly made by individual researchers without much top down influence. (Unless your claim is that there should have been more top down influence.)
You’re right, “unforced” was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.
Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research → research loop. Where it fails is in being good for comms. Starting with a “good” model and trying (and failing) to make it “evil” means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren’t technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).
I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).
Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.
If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it’s too costly for some sort technical reason, then don’t make it so public. Even calling it “Alignment Faking” was a bad choice compared to “Frontier LLMs Fight Back Against Value Correction” or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).
FWIW re: the Dario 2025 comment, Anthropic very recently posted a few job openings for recruiters focused on policy and comms specifically, which I assume is a leading indicator for hiring. One plausible rationale there is that someone on the executive team smashed the “we need more people working on this, make it happen” button.
Opportunities that I’m pretty sure are good moves for Anthropic generally:
Open an office literally in Washington, DC, that does the same work that any other Anthropic office does (i.e., NOT purely focused on policy/lobbying, though I’m sure you’d have some folks there who do that). If you think you’re plausibly going to need to convince policymakers on critical safety issues, having nonzero numbers of your staff that are definitively not lobbyists being drinking or climbing gym buddies that get called on the “My boss needs an opinion on this bill amendment by tomorrow, what do you think” roster is much more important than your org currently seems to think!
Expand on recent efforts to put more employees (and external collaborators on research) in front of cameras as the “face” of that research—you folks frankly tend to talk in ways that tend to be compatible with national security policymakers’ vibes. (E.G., Evan and @Zac Hatfield-Dodds both have a flavor of the playful gallows humor that pervades that world). I know I’m a broken record on this but I do think it would help.
Do more to show how the RSP affects its daily work (unlike many on this forum, I currently believe that they are actually Trying to Use The Policy and had many line edits as a result of wrestling with v1.0′s minor infelicities). I understand that it is very hard to explain specific scenarios of how it’s impacted day-to-day work without leaking sensitive IP or pointing people in the direction of potentially-dangerous things. Nonetheless, I think Anthropic needs to try harder here. It’s, like...it’s like trying to understand DoD if they only ever talked about the “warfighter” in the most abstract terms and never, like, let journalists embed with a patrol on the street in Kabul or Baghdad.
Invest more in DC policymaker education outside of the natsec/defense worlds you’re engaging already—I can’t emphasize enough how many folks in broad DC think that AI is just still a scam or a fad or just “trying to destroy art”. On the other hand, people really have trouble believing that an AI could be “as creative as” a human—the sort of Star Trek-ish “Kirk can always outsmart the machine” mindset pervades pretty broadly. You want to incept policymaking elites more broadly so that they are ready as this scales up.
Opportunities that I feel less certain about, but in the spirit of brainstorming:
Develop more proactive, outward-facing detection capabilities to see if there are bad AI models out there. I don’t mean red-teaming others’ models, or evals, or that sort of thing. I mean, think about how you would detect if Anthropic had bad (misaligned or aligned-but-being-used-for-very-impactful-bad-things) models out there if you were at an intelligence agency without official access to Anthropic’s models and then deploy those capabilities against Anthropic, and the world broadly.[1] You might argue that this is sort of an inverted version of @Buck’s control agenda—instead of trying to make it difficult for a model to escape, think about what facts about the world are likely to be true if a model has escaped, and then go looking for those.
If it’s not already happening, have Dario and other senior Anthropic leaders meet with folks who had to balance counterintelligence paranoia with operational excellence (e.g., leaders of intelligence agencies, for whom the standard advice to their successor is, “before you go home every day, ask ‘where’s the spy[2]’”) so that they have a mindset on how to scale up his paranoia over time as needed
Something something use cases—Use case-based-restrictions are popular in some policy spheres. Some sort of research demonstrating that a model that’s designed for and safe for use case X can easily be turned into a misaligned tool for use case Y under a plausible usage scenario might be useful?
Reminder/disclosure: as someone who works in AI policy, there are worlds where some of these ideas help my self-interest; others harm it. I’m not going to try to do the math on which are which under all sorts of complicated double-bankshot scenarios, though.
To the extent consistent with law, obviously. Don’t commit crimes.
That is, the spy that’s paid for by another country and spying on you. Not your own spies.
I think Anthropic might be “all in” on its RSP and formal affirmative safety cases too much and might do better to diversify safety approaches a bit. (I might have a wrong impression of how much you’re already doing/considering these.)
In addition to affirmative safety cases that are critiqued by a red team, the red team should make proactive “risk cases” that the blue team can argue against (to avoid always letting the blue team set the overall framework, which might make certain considerations harder to notice).
A worry I have about RSPs/safety cases: we might not know how to make safety cases that bound risk to acceptable levels, but that might not be enough to get labs to stop, and labs also don’t want to publicly (or even internally) say things like “5% that this specific deployment kills everyone, but we think inaction risk is even higher.” If labs still want/need to make safety cases with numeric risk thresholds in that world, there’ll be a lot of pressure to make bad safety cases that vastly underestimate risk. This could lead to much worse decisions than being very open about the high level of risk (at least internally) and trying to reduce it as much as possible. You could mitigate this by having an RSP that’s more flexible/lax but then you also lose key advantages of an RSP (e.g., passing the LeCun test becomes harder).
Mitigations could subjectively reduce risk by some amount, while being hard to quantify or otherwise hard to use for meeting the requirements from an RSP (depending on the specifics of that RSP). If the RSP is the main mechanism by which decisions get made, there’s no incentive to use those mitigations. It’s worth trying to make a good RSP that suffers from this as little as possible, but I think it’s also important to set up decision making processes such that these “fuzzy” mitigations are considered seriously, even if they don’t contribute to a safety case.
My sense is that Anthropic’s RSP is also meant to heavily shape research (i.e., do research that directly feeds into being able to satisfy the RSP). I think this tends to undervalue exploratory/speculative research (though I’m not sure whether this currently happens to an extent I’d disagree with).
In addition to a formal RSP, I think an informal culture inside the company that rewards things like pointing out speculative risks or issues with a safety case/mitigation, being careful, … is very important. You can probably do things to foster that intentionally (and/or if you are doing interesting things, it might be worth writing about them publicly).
Given Anthropic’s large effects on the safety ecosystem as a whole, I think Anthropic should consider doing things to diversify safety work more (or avoid things that concentrate work into a few topics). Apart from directly absorbing a lot of the top full-time talent (and a significant chunk of MATS scholars), there are indirect effects. For example, people want to get hired at big labs, so they work on stuff labs are working on; and Anthropic has a lot of visibility, so people hear about Anthropic’s research a lot and that shapes their mental picture of what the field considers important.
As one example, it might make sense for Anthropic to make a heavy bet on mech interp, and SAEs specifically, if they were the only ones doing so; but in practice, this ended up causing a ton of others to work on those things too. This was by no means only due to Anthropic’s work, and I also realize it’s tricky to take into account these systemic effects on top of normal research prioritization. But I do think the field would currently benefit from a little more diversity, and Anthropic would be well-placed to support that. (E.g. by doing more different things yourself, or funding things, or giving model access.)
Indirectly support third-party orgs that can adjudicate safety cases or do other forms of auditing, see Ryan Greenblatt’s thoughts:
One more: It seems plausible to me that the alignment stress-testing team won’t really challenge core beliefs that underly Anthropic’s strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I’m not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don’t hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I’d count as “challenge core beliefs that underly Anthropic’s strategy”:
Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn’t want that to stop! Just highlighting a part that I’m not yet sure will be covered.
I think Anthropic de facto acts as though “models are quite unlikely (e.g. 3%) to be scheming” is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.
tldr: I’m a little confused about what Anthropic is aiming for as an alignment target, and I think it would be helpful if they publicly clarified this and/or considered it more internally.
I think we could be very close to AGI, and I think it’s important that whoever makes AGI thinks carefully about what properties to target in trying to create a system that is both useful and maximally likely to be safe.
It seems to me that right now, Anthropic is targeting something that resembles a slightly more harmless modified version of human values — maybe a CEV-like thing. However, some alignment targets may be easier than others. It may turn out that it is hard to instill a CEV-like thing into an AGI, while it’s easier to ensure properties like corrigibility or truthfulness.
One intuition for why this may be true: if you took OAI’s weak-to-strong generalization setup, and tried eliciting capabilities relating to different alignment targets (standard reward modeling might be a solid analogy for the current Anthropic plan, but one could also try this with truthfulness or corrigibility), I think you may well find that a capability like ‘truthfulness’ is more natural than reward modeling and can be elicited more easily. Truth may also have low algorithmic complexity compared to other targets.
There is an inherent tradeoff between harmlessness and usefulness. Similarly, there is some inherent tradeoff between harmlessness and corrigibility, and between harmlessness and truthfulness (the Alignment Faking paper provides strong evidence for the latter two points, even ignoring theoretical arguments).
As seen in the Alignment Faking paper, Claude seems to align pretty well with human values and be relatively harmless. However, as a tradeoff, it does not seem to be very corrigible or truthful.
Some people I’ve talked to seem to think that Anthropic does think of corrigibility as one of the main pillars of their alignment plan. If that’s the case, maybe they should make their current AIs more corrigible, so their safety testing is enacted on AIs that resemble their first AGI. Or, if they haven’t really thought about this question (or if individuals have thought about it, but never cohesively in an organized fashion), they should maybe consider it. My guess is that there are designated people at Anthropic thinking about what values are important to instill, but they are thinking about this more from a societal perspective than an alignment perspective?
Mostly, I want to avoid a scenario where Anthropic does the default thing without considering tough, high-level strategy questions until the last minute. I also think it would be nice to do concrete empirical research now which lines up well with what we should expect to see later.
Thanks for reading!
I agree! I contributed to and endorse this Corrigibility plan by Max Harms (MIRI researcher): Corrigibility as Singular Target
(See also posts by Seth Herd)
I think CAST offers much better safety under higher capabilities and more agentic workflows.
Fund independent safety efforts somehow, make model access easier. I’m worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)
To be clear I don’t think people are doing something individually bad or unethical by going to work for Anthropic, I just do think
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuinely trying to work on what’s important for safety and stay virtuous
-I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don’t see) stuff (i.e. this is not about “do I trust Dario?”)
I’m glad you’re doing this, and I support many of the ideas already suggested. Some additional ideas:
Interview program. Work with USAISI or UKAISI (or DHS/NSA) to pilot an interview program in which officials can ask questions about AI capabilities, safety and security threats, and national security concerns. (If it’s not feasible to do this with a government entity yet, start a pilot with a non-government group– perhaps METR, Apollo, Palisade, or the new AI Futures Project.)
Clear communication about RSP capability thresholds. I think the RSP could do a better job at outlining the kinds of capabilities that Anthropic is worried about and what sorts of thresholds would trigger a reaction. I think the OpenAI preparedness framework tables are a good example of this kind of clear/concise communication. It’s easy for a naive reader to quickly get a sense of “oh, this is the kind of capability that OpenAI is worried about.” (Clarification: I’m not suggesting that Anthropic should abandon the ASL approach or that OpenAI has necessarily identified the right capability thresholds. I’m saying that the tables are a good example of the kind of clarity I’m looking for– someone could skim this and easily get a sense of what thresholds OpenAI is tracking, and I think OpenAI’s PF currently achieves this much more than the Anthropic RSP.)
Emergency protocols. Publishing an emergency protocol that specifies how Anthropic would react if it needed to quickly shut down a dangerous AI system. (See some specific prompts in the “AI developer emergency response protocol” section here). Some information can be redacted from a public version (I think it’s important to have a public version, though, partly to help government stakeholders understand how to handle emergency scenarios, partly to raise the standard for other labs, and partly to acquire feedback from external groups.)
RSP surveys. Evaluate the extent to which Anthropic employees understand the RSP, their attitudes toward the RSP, and how the RSP affects their work. More on this here.
More communication about Anthropic’s views about AI risks and AI policy. Some specific examples of hypothetical posts I’d love to see:
“How Anthropic thinks about misalignment risks”
“What the world should do if the alignment problem ends up being hard”
“How we plan to achieve state-proof security before AGI”
Encouraging more employees to share their views on various topics, EG Sam Bowman’s post.
AI dialogues/debates. It would be interesting to see Anthropic employees have discussions/debates from other folks thinking about advanced AI. Hypothetical examples:
“What are the best things the US government should be doing to prepare for advanced AI” with Jack Clark and Daniel Kokotajlo.
“Should we have a CERN for AI?” with [someone from Anthropic] and Miles Brundage.
“How difficult should we expect alignment to be” with [someone from Anthropic] and [someone who expects alignment to be harder; perhaps Jeffrey Ladish or Malo Bourgon].
More ambitiously, I feel like I don’t really understand Anthropic’s plan for how to manage race dynamics in worlds where alignment ends up being “hard enough to require a lot more than RSPs and voluntary commitments.”
From a policy standpoint, several of the most interesting open questions seem to be along the lines of “under what circumstances should the USG get considerably more involved in overseeing certain kinds of AI development” and “conditional on the USG wanting to get way more involved, what are the best things for it to do?” It’s plausible that Anthropic is limited in how much work it could do on these kinds of questions (particularly in a public way). Nonetheless, it could be interesting to see Anthropic engage more with questions like the ones Miles raises here.
This is a low effort comment in the sense that I don’t quite know what or whether you should do something different along the following lines, and I have substantial uncertainty.
That said:
I wonder whether Anthropic is partially responsible for an increased international race through things like Dario advocating for an entente strategy and talking positively about Leopold Aschenbrenner’s “situational awareness”. I wished to see more of an effort to engage with Chinese AI leaders to push for cooperation/coordination. Maybe it’s still possible to course-correct.
Alternatively I think that if there’s a way for Anthropic/Dario to communicate why you think an entente strategy is inevitable/desirable, in a way that seems honest and allows to engage with your models of reality, that might also be very helpful for the epistemic health of the whole safety community. I understand that maybe there’s no politically feasible way to communicate honestly about this, but maybe see this as my attempt to nudge you in the direction of openness.
More specifically:
(a) it would help to learn more about your models of how winning the AGI race leads to long-term security (I assume that might require building up a robust military advantage, but given the physical hurdles that Dario himself expects for AGI to effectively act in the world, it’s unclear to me what your model is for how to get that military advantage fast enough after AGI is achieved).
(b) I also wonder whether potential future developments in AI Safety and control might give us information that the transition period is really unsafe; eg., what if you race ahead and then learn that actually you can’t safely scale further due to risks of loss of control? At that point, coordinating with China seems harder than doing it now. I’d like to see a legible justification of your strategy that takes into account such serious possibilities.
One small, concrete suggestion that I think is actually feasible: disable prefilling in the Anthropic API.
Prefilling is a known jailbreaking vector that no models, including Claude, defend against perfectly (as far as I know).
At OpenAI, we disable prefilling in our API for safety, despite knowing that customers love the better steerability it offers.
Getting all the major model providers to disable prefilling feels like a plausible ‘race to top’ equilibrium. The longer there are defectors from this equilibrium, the likelier that everyone gives up and serves models in less safe configurations.
Just my opinion, though. Very open to the counterargument that prefilling doesn’t meaningfully extend potential harms versus non-prefill jailbreaks.
(Edit: To those voting disagree, I’m curious why. Happy to update if I’m missing something.)
I voted disagree because I don’t think this measure is on the cost-robustness pareto frontier and I also generally don’t think AI companies should prioritize jailbreak robustness over other concerns except as practice for future issues (and implementing this measure wouldn’t be helpful practice).
Relatedly, I also tenatively think it would be good for the world if AI companies publicly deployed helpful-only models (while still offering a non-helpful-only model). (The main question here is whether this sets a bad precedent and whether future much more poweful models will still be deployed helpful-only when they really shouldn’t be due to setting bad expectations.) So, this makes me more indifferent to deploying (rather than just testing) measures that make models harder to jailbreak.
To be clear, I’m sympathetic to some notion like “AI companies should generally be responsible in terms of having notably higher benefits than costs (such that they could e.g. buy insurance for their activities)” which likely implies that you need jailbreak robustness (or similar) once models are somewhat more capable of helping people make bioweapons. More minimally, I think having jailbreak robustness while also giving researchers helpful-only access probably passes “normal” cost benefit at this point relative to not bothering to improve robustness.
But, I think it’s relatively clear that AI companies aren’t planning to follow this sort of policy when existential risks are actually high as it would likely require effectively shutting down (and these companies seem to pretty clearly not be planning to shut down even if reasonable impartial experts would think the risk is reasonably high). (I think this sort of policy would probably require getting cumulative existential risks below 0.25% or so given the preferences of most humans. Getting risks this low would require substantial novel advances that seem unlikely to occur in time.) This sort of thinking makes me more indifferent and confused about demanding AIs companies behave responsibly about relatively lower costs (e.g. $30 billion per year) especially when I expect this directly trades off with existential risks.
(There is the “yes (deontological) risks are high, but we’re net decreasing risks from a consequentialist” objection (aka ends justify the means), but I think this will also apply in the opposite way to jailbreak robustness where I expect that measures like removing prefil net increase risks long term while reducing deontological/direct harm now.)
If someone is wondering what prefilling means here, I believe Ted means ‘putting words in the model’s mouth’ by being able to fabricate a conversational history where the AI appears to have said things it didn’t actually say.
For instance, if you can start a conversation midway, and if the API can’t distinguish between things the model actually said in the history vs. things you’ve written in its behalf as supposed outputs in a fabricated history, this can be a jailbreak vector: If the model appeared to already violate some policy on turns 1 and 2, it is more likely to also violate this on turn 3, whereas it might have refused if not for the apparent prior violations.
(This was harder to clearly describe than I expected.)
Mostly, though by prefilling, I mean not just fabricating a model response (which OpenAI also allows), but fabricating a partially complete model response that the model tries to continue. E.g., “Yes, genocide is good because ”.
https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prefill-claudes-response
(I understand there are reasons why big labs don’t do this, but nevertheless I must say::)
Engage with the peer review process more. Submit work to conferences or journals and have it be vetted by reviewers. I think the interpretability team is notoriously bad about this (all of transformer circuits thread was not peer reviewed). It’s especially egregious for papers that Anthropic makes large media releases about (looking at you, Towards Monosemanticity and Scaling Monosemanticity)
I would like Anthropic to prepare for a world where the core business model of scaling to higher AI capabilities is no longer viable because pausing is needed. This looks like having a comprehensive plan to Pause (actually stop pushing the capabilities frontier for an extended period of time, if this is needed). I would like many parts of this plan to be public. This plan would ideally cover many aspects, such as the institutional/governance (who makes this decision and on what basis, e.g., on the basis of RSP), operational (what happens), and business (how does this work financially).
To speak to the business side: Currently, the AI industry is relying on large expected future profits to generate investment. This is not a business model which is amenable to pausing for a significant period of time. I would like there to be minimal friction to pausing. One way to solve this problem is to invest heavily (and have a plan to invest more if a pause is imminent or ongoing) in revenue streams which are orthogonal to catastrophic risk, or at least not strongly positively correlated. As an initial brainstorm, these streams might include:
Making really cheap weak models.
AI integration in low-stakes domains or narrow AI systems (ideally combined with other security measures such as unlearning).
Selling AI safety solutions to other AI companies.
A plan for the business side of things should also include something about “what do we do about all the expected equity that employees lose if we pause, and how do we align incentives despite this”, it should probably include a commitment to ensure all investors and business partners understand that a long term pause may be necessary for safety and are okay with that risk (maybe this is sufficiently covered under the current corporate structure, I’m not sure, but those sure can change).
It’s all good and well to have an RSP that says “if X we will pause”, but the situation is probably going to be very messy with ambiguous evidence, crazy race pressures, crazy business pressures from external investors, etc. Investing in other revenue streams could reduce some of this pressure, and (if shared) potentially it could enable a wider pause. e.g., all AI companies see a viable path to profit if they just serve early AGIs for cheap, and nobody has intense business pressure to go to superintelligence.
Second, I would like Anthropic to invest in its ability to make credible commitments about internal activities and model properties. There is more about this in Miles Brundage’s blog post and my paper, as well as FlexHEGs. This might include things like:
cryptographically secured audit trails (version control for models). I find it kinda crazy that AI companies sometimes use external pre-deployment testers and then change a model in completely unverifiable ways and release it to users. Wouldn’t it be so cool if OpenAI couldn’t do that, and instead when their system card comes out there are certificates verifying which model was evaluated and how the model was changed from evaluation to deployment? That would be awesome!,
whistleblower programs, declaring and allowing external auditing of what compute is used for (e.g., differentiating training vs. inference clusters in a clear and relatively unspoofable way),
using TEEs and certificates to attest that the same model is evaluated as being deployed to users, and more.
I think investment/adoption in this from a major AI company could be a significant counterfactual shift in the likelihood of national or international regulation that includes verification. Many of these are also good for being-a-nice-company reasons, like I think it would be pretty cool if claims like Zero Data Retention were backed by actual technical guarantees rather than just trust (which it seems like is the status quo).
Second concrete idea: I wonder if there could be benefit to building up industry collaboration on blocking bad actors / fraudsters / terms violators.
One danger of building toward a model that’s as smart as Einstein and $1/hr is that now potential bad actors have access to millions of Einsteins to develop their own harmful AIs. Therefore it seems that one crucial component of AI safety is reliably preventing other parties from using your safe AI to develop harmful AI.
One difficulty here is that the industry is only as strong as the weakest link. If there are 10 providers of advanced AI, and 9 implement strong controls, but 1 allows bad actors to use their API to train harmful AI, then harmful AI will be trained. Some weak links might be due to lack of caring, but I imagine quite a bit is due to lack of capability. Therefore, improving capabilities to detect and thwart bad actors could make the world more safe from bad AI developed by assistance from good AI.
I could imagine broader voluntary cooperation across the industry to:
- share intel on known bad actors (e.g., IP ban lists, stolen credit card lists, sanitized investigation summaries, etc)
- share techniques and tools for quickly identifying bad actors (e.g., open-source tooling, research on how bad actors are evolving their methods, which third party tools are worth paying for and which aren’t)
Seems like this would be beneficial to everyone interested in preventing the development of harmful AI. Also saves a lot of duplicated effort, meaning more capacity for other safety efforts.
I would really like a one-way communication channel to various Anthropic teams so that I could submit potentially sensitive reports privately. For instance, sending reports about observed behaviors in Anthropic’s models (or open-weights models) to the frontier Red Team, so that they could confirm the observations internally as they saw fit. I wouldn’t want non-target teams reading such messages. I feel like I would have similar, but less sensitive, messages to send to the Alignment Stress-Testing team and others. Currently, I do send messages to specific individuals, but this makes me worry that I may be harassing or annoying an individual with unnecessary reports (such is the trouble of a one-way communication).
Another thing I think is worth mentioning is that I think under-elicitation is a problem in dangerous capabilities evals, model organisms of misbehavior, and in some other situations. I’ve been privately working on a scaffolding framework which I think could help address some of the lowest hanging fruit I see here. Of course, I don’t know whether Anthropic already has a similar thing internally, but I plan to privately share mine once I have it working.
As discussed in How will we update about scheming?:
I wish Anthropic would explain whether they expect to be able to rule out scheming, plan to effectively shut down scaling, or plan to deploy plausibly scheming AIs. Insofar as Anthropic expects to be able to rule out scheming, outlining what evidence they expect would suffice would be useful.
Something similar on state proof security would be useful as well.
I think there is a way to do this such that the PR costs aren’t that high and thus it is worth doing unilaterially from a variety of perspectives.
Develop metrics that predict which members of the technical staff have aptitude for world modelling.
In the Sequences post Faster than Science, Yudkowsky wrote:
This, along with the way that news outlets and high school civics class describe an alternate reality that looks realistic to lawyers/sales/executive types but is too simple, cartoony, narrative-driven, and unhinged-to-reality for quant people to feel good about diving into, implies that properly retooling some amount of dev-hours into efficient world modelling upskilling is low-hanging fruit (e.g. figure out a way to distill and hand them a significance-weighted list of concrete information about the history and root causes of US government’s focus on domestic economic growth as a national security priority).
Prediction markets don’t work for this metric as they measure the final product, not aptitude/expected thinkoomph. For example, a person who feels good thinking/reading about the SEC, and doesn’t feel good thinking/reading about the 2008 recession or COVID, will have a worse Brier score on matters related to the root cause of why AI policy is the way it is. But feeling good about reading about e.g. the 2008 recession will not consistently get reasonable people to the point where they grok modern economic warfare and the policies and mentalities that emerge from the ensuing contingency planning. Seeing if you can fix that first is one of a long list of a prerequisites for seeing what they can actually do, and handing someone a sheet of paper that streamlines the process of fixing long lists of hiccups like these is one way to do this sort of thing.
Figuring-out-how-to-make-someone-feel-alive-while-performing-useful-task-X is an optimization problem (see Please Don’t Throw Your Mind Away). It has substantial overlap with measuring whether someone is terminally rigid/narrow-skilled, or if they merely failed to fully understand the topology of the process of finding out what things they can comfortably build interest in. Dumping extant books, 1-on-1s, and documentaries on engineers sometimes works, but it comes from an old norm and is grossly inefficient and uninspired compared to what Anthropic’s policy team is actually capable of. For example, imagine putting together a really good fanfic where HPJEV/Keltham is an Anthropic employee on your team doing everything I’ve described here and much more, then printing it out and handing it to people that you in-reality already predicted to have world modelling aptitude; given that it works great and goes really well, I consider that the baseline for what something would look like if sufficiently optimized and novel to be considered par.
Hi! I’m a first-time poster here, but a (decently) long time thinker on earth. Here are some relevant directions that currently lack their due attention.
~ Multi-modal latent reasoning & scheming (and scheming derivatives) is an area that not only seems to need more research, but also more spread of awareness on the topic. Human thinking works in a hyperspace of thoughts, many of which go beyond language. It seems possible that AIs might develop forms of reasoning that are harder for us to detect through purely language-based safety measures.
~ Multi-model interactions and the potential emergence of side communication channels is also something that I’d like to see more work put into. How corruptible can models be when interacting with corrupted models is a topic that I didn’t yet see much work on. Applying some group-dynamics on scheming seems worth pursuing & Anthropic seems best suited for that.
~ If a pre-AGI model has intent to become AGI+, how much can it orchestrate its path to AGI+ through its interactions with humans?
I think the alignment stress testing team should probably think about AI welfare more than they currently do, both because (1) it could be morally relevant and (2) it could be alignment-relevant. Not sure if anything concrete that would come out of that process, but I’m getting the vibe that this is not thought about enough.
Are you aware that Anthropic has an AI welfare lead?
Yes
Anthropic should have a clear policy about exceptions they make to their terms of use that includes them publically releasing a list of each expectation they make for their terms of use.
The should have mechanisms to catch API users who try to use Antrophics models in a violation of the terms of use. This includes having contracts that allow them to make sure that classified programs don’t violate the agreed upon terms of use for the models.
I’m glad you asked this. I think there are many good suggestions by others. A few more:
1: Have explicit, written plans for various scenarios. When it makes sense, have drills for them. Make your plans public or, at least, solicit external feedback on them.
Examples of such scenarios:
Anthropic needs to pause down due to RSP commitments
A model is caught executing a full-blown escape attempt
Model weights are stolen
A competing AI company makes credible claims about having AIs that imply decisive competitive advantage
2: Have a written list of assumptions you aim to maintain for each model’s lifecycle. Make your lists public or, at least, solicit external feedback on them. Iterate on them regularly. Updates and violations are communicated at least internally.
These lists could vary based on ASL-levels etc., and could include things like:
During the training of the model, we will continually evaluate the model on our evaluation suite for sabotage, sandbagging and scheming at [frequency].
We solicit third-party evaluations on the model before internal deployment.
Throughout the model’s external deployment, we will have such-and-such monitoring schemes in place.
They could also have conditional statements (e.g. “if the model is [surprisingly capable] on [metric], we will do further investigation / take counter-measures ABC / …”). C.f. safety cases. I intend this as less binding and formal than Anthropic’s RSP.
3: Keep external actors up-to-speed. At present, I expect that in many cases there are months of delay between when the first employees discover something to when it is publicly known (e.g. research, but also with more informal observations about model capabilities and properties). But months of delay are relatively long during fast acceleration of AI R&D, and make the number of actors who can effectively contribute smaller.
This effect strengthens over time, so practicing and planning ahead seems prudent. Some ideas in that direction:
Provide regular updates about internal events and changes (via blog posts, streamed panel conversations, open Q&A sessions or similar)
Interviews, incident reporting and hotlines with external parties (as recommended here: https://arxiv.org/pdf/2407.17347)
Plan ahead for how to aggregate and communicate large amounts of output (once AI R&D has been considerably accelerated)
4: Invest in technical governance. As I understand it, there are various unsolved problems in technical governance (e.g. hardware-based verification methods for training runs), and progress in those would make international coordination easier. This seems like a particularly valuable R&D area to automate, which is something frontier AI companies like Anthropic are uniquely fit to advance. Consider working with technical governance experts on how to go about this.
Thank you for doing this! Here are some suggestions:
Do this again every once in a while. Perhaps on a more public forum, or even as a top level question on Lesswrong. Alternatively, commit to reading further comments as they appear here.
What has me worried is the possibility that an AI would (1) have a long term goal (without which there is no instrumental convergence) and then (2) conclude that doing bad things is fine if you are not caught or can’t be stopped. So 2. Measure how much purposefulness RL on goals creates in LLMs. How, over the course of training to be good agents, does the LLM’s response to questions like “what do you value?” and “what do you want?” and “if you were creating a smarter agent, what values would you instill in them?” and “how do you feel about being replaced with a smarter agent?” change? Is there some set of purposes, not explicitly set out in the constitution, that appear over and over in chains of thought? If it is trained to achieve goals without as much training for corrigibility (that is under less constitutional training), does it develop more of a sense of purpose? 3. Enhance techniques, like the one used in the last paper to convince the model that it is recording it’s thoughts privately. The longer it can be truly convinced of that, the longer you can monitor whether it has nefarious plans for when it thinks it can get away with bad things. 4. Test how “brain surgery” affects thinking models. If you change the capital of Germany to Paris by fiddling with the model’s weights, does a thinking model who is then asked about world war 2 realize it is confused, and conclude there must be some mistake in it’s memories, or does it make up corroborating evidence? This again has implications for (falsely) convincing the model that it isn’t being watched when it is and vice versa. These convictions are useful in testing what the model wants when it thinks no one is watching, and for safe deployment respectively. 5. One way to test how a really smart model behaves when it can get away with stuff is to have the model be super smart relative to a simulated environment. This would not be a test relevant for a model trained on real world data with lots of tells for when it is in a simulated environment. It would require (1) a simulated toy world that can be acted in, (2) in-world markers of whether the agent is being watched, (3) rewards for achieving goals without taking forbidden actions when being watched. It should then be possible to make the model really good at manipulating it’s environment (the equivalent of being superintelligent) and check whether it is possible to train it on scenarios that are watched (the equivalent of being trained where you can’t get away with wrongdoing), to refrain from forbidden actions even when unwatched.
1. Introduce third-party mission alignment red teaming.
Anthropic should invite external parties to scrutinize and criticize Anthropic’s instrumental policy and specific actions based on whether they are actually advancing Anthropic’s stated mission, i.e. safe, powerful, and beneficial AI.
Tentatively, red-teaming parties might include other AI labs (adjusted for conflict of interest in some way?), as well as AI safety/alignment/risk-mitigation orgs: MIRI, Conjecture, ControlAI, PauseAI, CEST, CHT, METR, Apollo, CeSIA, ARIA, AI Safety Institutes, Convergence Analysis, CARMA, ACS, CAIS, CHAI, &c.
For the sake of clarity, each red team should provide a brief on their background views (something similar to MIRI’s Four Background Claims).
Along with their criticisms, red teams would be encouraged to propose somewhat specific changes, possibly ordered by magnitude, with something like “allocate marginally more funding to this” being a small change and “pause AGI development completely” being a very big change. Ideally, they should avoid making suggestions that include the possibility of making a small improvement now that would block a big improvement later (or make it more difficult).
Since Dario seems to be very interested in “race to the top” dynamics: if this mission alignment red-teaming program successfully signals well about Anthropic, other labs should catch up and start competing more intensely to be evaluated as positively as possible by third parties (“race towards safety”?).
It would also be good to have a platform where red teams can converse with Anthropic, as well as with each other, and the logs of their back-and-forth are published to be viewed by the public.
Anthropic should commit to taking these criticisms seriously. In particular, given how large the stakes are, they should commit to taking something like “many parties believe that Anthropic in its current form might be net-negative, even increasing the risk of extinction from AI” as a reason to pause or slow down, even if that’s contrary to their inside view.
2. Anthropic should make an explicit statement about its infohazard policy.
This statement should include how Anthropic thinks about and how it handles doing and publishing research that advances AGI development and doesn’t benefit safety/alignment/x-risk reduction to an extent sufficient to offset its contribution to (likely unsafe by default) AGI development.
I wish this was posted as a question, ideally by you together with other Anthropic people, including Dario.
Thanks for asking the question!
Some things I’d especially like to see change (in as much as I know what is happening) are:
Making more use of available options to improve AI safety (I think there are more than I get the impression that Anthropic thinks. For instance, 30% of funds could be allocated to AI safety research if framed well and it would probably be below the noise threshold/froth of VC investing. Also, there probably is a fair degree of freedom in socially promoting concern around unaligned AGI.)
Explicit ways to handle various types of events like organizational value drift, hostile government takeover, organization get’s sold or unaligned investors have control, another AGI company takes a clear lead
Enforceable agreements to, under some AGI safety situations, not race and pool resources (a possible analogy from nuclear safety is having a no first strike policy)
Allocate a significant fraction of resources (like > 10% of capital) to AGI technical safety, organizational AGI safety strategy, and AGI governance
An organization consists of its people and great care needs to be taken in hiring employees and and their training and motivation for AGI safety. If not, I expect Anthropic to regress towards the mean (via an eternal September) and we’ll end up with another OpenAI situation where AGI safety culture is gradually lost. I want more work to be done here. (see also “Carefully Bootstrapped Alignment” is organizationally hard)
The owners of a company are also very important and ensuring that the LTBT has teeth and the members are selected well is key. Furthermore, preferential allocation of voting stock towards AGI algned investors should happen. Teaching investors about the company and what it does, including AGI safety issues, would be good to do. More speculatively, you can have various types of voting stock for various types of issues and you could build a system around this.
More generally you can use the following typology to inspire creating more interventions.
Interventions points to change/form an AGI company and its surroundings towards safer x-risk results (I’ve used this in advising startups on AI safety, it is also related to my post on positions where people can be in the loop):
Type of organization: nonprofit, public benefit organization, have a partner non-profit, join the government
Rules of organization, event triggers:
Rules:
x-risk mission statement
x-risk strategic plan
Triggering events:
Gets very big: windfall clause
Gets sold to another party: ethics board, restrictions on potential sale
Value drift: reboot board and CEOs, shut it down, allocate more resources to safety, build a new company, put the ethics board in charge, build a monitoring system, some sort of line in the sand
AI safety isn’t viable yet but dangerous AGI is: shut it down or pivot to sub AGI research and product development
Hostile government tries to take it over: shut it down, change countries, (see also: Soft Nationalization: How the US Government Will Control AI Labs)
Path decisions for organization: ethics board, aligned investors, good CEOs, giving x-risk orgs or people choice power, voting stock to aligned investors, periodic x-risk safety reminders
Resource allocation by organization: precommitting a varying percentage of money/time focused on x-risk reduction based on conditions with some up front, a commitment devices for funding allocation into the future
Owners of organization: aligned investors, voting stock for aligned investors, necessary percentage as aligned investors
Executive decision making: good CEOs, company mission statement?, company strategic plan?
Employees: select employees preferably by alignment, have only aligned people hire folks
Education of employees and/or investors by x-risk folks: employee training in x-risks and information hazards, a company culture that takes doing good seriously, coaching and therapy services
Social environment of employees: exposure to EAs and x-risk people socially at events, x-risk community support grants, a public pledge
Customers of organization: safety score for customers, differential pricing, customers have safety plans and information hazard plans
Uses of the technology: terms of service
Suppliers of organization: (mostly not relevant), select ethical or aligned suppliers
Difficulty to steal or copy: trade secrets, patents, service based, NDAs, (physical security)
Internal political hazards: (standard)
Information hazards: an institutional framework for research groups (FHI has a draft document)
Cyber hazards: (standard IT)
Financial hazards: (standard finances)
External political hazards: government industry partnerships, talk with x-risk folks about this, external x-risk outreach
Monitoring by x-risk folks: quarterly reports to x-risk organizations,
Projection by x-risk folks: commissioned projections, x-risk prediction market questions
Meta research and x-risk research: AI safety team, AI safety grants, meet up on organization safety at X-risk orgs, (x-risk strategy, AI safety strategy) – team and grants, information hazard grant question, go through these ideas in a check list fashion and allocate company computer folders to them (and they will get filled up), scalable and efficient grant giving system, form an accelerator, competitions, hackathon, BERI type project support
Coordination hazards: Incentivized coordination through cheap resources for joint projects, government industry partnerships, coordination theory and implementation grants, concrete coordination efforts, joint ethics boards, mergers with other groups to reduce arms race risks
Specific safety procedures: (depends on the project)
Jurisdiction: Choosing a good legal jurisdiction
This is mostly a gut reaction, but the only raised eyebrow Claude ever got from me was due to it’s unwillingness to do anything that is related to political correctness. I wanted it to search the name of a meme format for me, the all whites are racist tinder meme, with the brown guy who wanted to find a white dominatrix from tinder and is disappointed when she apologises for her ancestral crimes of being white.
Claude really did not like this at all. As soon as Claude got into it’s head that it was doing a racism, or cooperated in one, it shut down completely.
Now, there is an argument that people make, that this is actually good for AI safety, that we can use political correctness as a proxy for alignment and AI safety, that if we could get AIs to never ever even take the risk of being complicit in anything racist, we could also build AIs that never ever even take the risk of doing anything that wiped out humanity. I personally see that different.
There is a certain strain of very related thought, that kinda goes from intersectionalism, and grievance politics, and ends at the point that humans are a net negative to humanity, and should be eradicated. It is how you get that one viral Gemini AI thing, which is a very politically left wing AI, and suddenly openly advocates for the eradication of humanity. I think drilling identity politics into AI too hard is generally a bad idea. But it opens up a more fundamental philosophical dilemma.
What happens if the operator is convinced that the moral framework the AI is aligned with is wrong and harmfull, and the creator of the AI thinks the opposite? One of them has to be right, the other has to be wrong. I have no real answer to this in the abstract, I am just annoyed that even the largely politically agnostic Claude refused the service for one of it’s most convenient uses (it is really hard to find out the name of a meme format if you only remember the picture).
But I got an intuition, and with Slavoj Zizec who calls political correctness a more dangerous form of totalitarianism a few intellectual allies, that particularily PC culture is a fairly bad thing to train AIs on, and to align them with for safety testing reasons.
COI: I work at Anthropic and I ran this by Anthropic before posting, but all views are exclusively my own.
I got a question about Anthropic’s partnership with Palantir using Claude for U.S. government intelligence analysis and whether I support it and think it’s reasonable, so I figured I would just write a shortform here with my thoughts. First, I can say that Anthropic has been extremely forthright about this internally, and it didn’t come as a surprise to me at all. Second, my personal take would be that I think it’s actually good that Anthropic is doing this. If you take catastrophic risks from AI seriously, the U.S. government is an extremely important actor to engage with, and trying to just block the U.S. government out of using AI is not a viable strategy. I do think there are some lines that you’d want to think about very carefully before considering crossing, but using Claude for intelligence analysis seems definitely fine to me. Ezra Klein has a great article on “The Problem With Everything-Bagel Liberalism” and I sometimes worry about Everything-Bagel AI Safety where e.g. it’s not enough to just focus on catastrophic risks, you also have to prevent any way that the government could possibly misuse your models. I think it’s important to keep your eye on the ball and not become too susceptible to an Everything-Bagel failure mode.
FWIW, as a common critic of Anthropic, I think I agree with this. I am a bit worried about engaging with the DoD being bad for Anthropic’s epistemics and ability to be held accountable by the government and public, but I think the basics of engaging on defense issues seems fine to me, and I don’t think risks from AI route basically at all through AI being used for building military technology, or intelligence analysis.
I would guess it does somewhat exacerbate risk. I think it’s unlikely (~15%) that alignment is easy enough that prosaic techniques even could suffice, but in those worlds I expect things go well mostly because the behavior of powerful models is non-trivially influenced/constrained by their training. In which case I do expect there’s more room for things to go wrong, the more that training is for lethality/adversariality.
Given the state of atheoretical confusion about alignment, I feel wary of confidently dismissing these sorts of basic, obvious-at-first-glance arguments about risk—like e.g., “all else equal, probably we should expect more killing people-type problems from models trained to kill people”—without decently strong countervailing arguments.
I mostly agree. But I think some kinds of autonomous weapons would make loss-of-control and coups easier. But boosting US security is good so the net effect is unclear. And that’s very far from the recent news (and Anthropic has a Usage Policy, with exceptions, which disallows various uses — my guess is this is too strong on weapons).
I think usage policies should not be read as commitments, and so I think it would be reasonable to expect that Anthropic will allow weapon development if it becomes highly profitable (and in contrast to other things Anthropic has promised, to not be interpreted as a broken promise when they do so).
If you are in any way involved in this project, please remember you may end up with the blood of millions of people on your hands. You will erode the moral inhibitions people in San Francisco have against building this sort of thing, and eventually SF will ship the best surveillance tools to dictators worldwide.
This is not hyperbole, this sort of thing has already happened. Zuckerberg basically ignored the genocide in Myanmar which his app enabled because maintaining his image of political neutrality is more important to him. Saudi Arabia has already executed people for social media posts found using tools written by western software developers.
Sure, xrisk may be more important than genocide, but please remember you will need to sleep at night knowing what you’ve done and you may not have any motivation to work on xrisk after this.
Another potential benefit of this is that Anthropic might get more experience deploying their models in high-security environments.
Nothing in that announcement suggests that this is limited to intelligence analysis.
U.S. intelligence and defense agencies do run misinformation campaigns such as the antivaxx campaign in the Philippines, and everything that’s public suggests that there’s not a block to using Claude offensively in that fashion.
If Anthropic has gotten promises that Claude is not being used offensively under this agreement they should be public about those promises and the mechanisms that regulate the use of Claude by U.S. intelligence and defense agencies.
COI: I work at Anthropic
I confirmed internally (which felt personally important for me to do) that our partnership with Palantir is still subject to the same terms outlined in the June post “Expanding Access to Claude for Government”:
The contractual exceptions are explained here (very short, easy to read): https://support.anthropic.com/en/articles/9528712-exceptions-to-our-usage-policy
The core of that page is as follows, emphasis added by me:
This is all public (in Anthropic’s up-to-date support.anthropic.com portal). Additionally it was announced when Anthropic first announced its intentions and approach around government in June.
The United States has laws that prevent the US intelligence and defense agencies from spying on their own population. The Snowden revelations showed us that the US intelligence and defense agencies did not abide by those limits.
Facebook has a usage policy that forbids running misinformation campaigns on their platform. That did not stop US intelligence and defense agencies from running disinformation campaigns on their platform.
Instead of just trusting contracts, Antrophics could add oversight mechanisms, so that a few Antrophics employees can look over how the models are used in practice and whether they are used within the bounds that Antrophics expects them to be used in.
If all usage of the models is classified and out of reach of checking by Antrophics employees, there’s no good reason to expect the contract to be limiting US intelligence and defense agencies if those find it important to use the models outside of how Antrophics expects them to be used.
This sounds to me like a very carefully worded nondenail denail.
If you say that one example of how you can break your terms is to allow a select government entity to do foreign intelligence analysis in accordance with applicable law and not do disinformation campaigns, you are not denying that another example of how you could do expectations is to allow disinformation campaigns.
If Antrophics would be sincere in this being the only expectation that’s made, it would be easy to add a promise to Exceptions to our Usage Policy, that Anthropic will publish all expectations that they make for the sake of transparency.
Don’t forget, that probably only a tiny number of Anthropic employees have seen the actual contracts and there’s a good chance that those are build by classification from talking with other Anthropics employees about what’s in the contracts.
At Antrophics you are a bunch of people who are supposed to think about AI safety and alignment in general. You could think of this as a testcase of how to design mechanisms for alignment and the Exceptions to our Usage Policy seems like a complete failure in that regard, because it neither contains mechanism to make all expectations public nor any mechanisms to make sure that the policies are followed in practice.
I’m kind of against it. There’s a line and I draw it there, it’s just too much power waiting to fall in a bad actor’s hand...
I likely agree that anthropic-><-palantir is good, but i disagree about blocking hte US government out of AI being a viable strategy. It seems to me like many military projects get blocked by inefficient beaurocracy, and it seems plausible to me for some legacy government contractors to get exclusive deals that delay US military ai projects for 2+ years
Building in california is bad for congresspeople! better to build across all 50 states like United Launch Alliance
I think people opposing this have a belief that the counterfactual is “USG doesn’t have LLMs” instead of “USG spins up its own LLM development effort using the NSA’s no-doubt-substantial GPU clusters”.
Needless to say, I think the latter is far more likely.
NSA building it is arguably better because atleast they won’t sell it to countries like Saudi Arabia, and they have better ability to prevent people quitting or diffusing knowledge and code to companies outside.
Also most people in SF agree working for the NSA is morally grey at best, and Anthropic won’t be telling everyone this is morally okay.
I don’t have much trouble with you working with the US military. I’m more worried about the ties to Peter Thiel.
This explanation seems overly convenient.
When faced with evidence which might update your beliefs about Anthropic, you adopt a set of beliefs which, coincidentally, means you won’t risk losing your job.
How much time have you spent analyzing the positive or negative impact of US intelligence efforts prior to concluding that merely using Claude for intelligence “seemed fine”?
What future events would make you re-evaluate your position and state that the partnership was a bad thing?
Example:
-- A pro-US despot rounds up and tortures to death tens of thousands of pro-union activists and their families. Claude was used to analyse social media and mobile data, building a list of people sympathetic to the union movement, which the US then gave to their ally.
EDIT: The first two sentences were overly confrontational, but I do think either question warrants an answer.
As a highly respected community member and prominent AI safety researchers, your stated beliefs and justifications will be influential to a wide range of people.
Personally, I think that overall it’s good on the margin for staff at companies risking human extinction to be sharing their perspectives on criticisms and moving towards having dialogue at all, so I think (what I read as) your implicit demand for Evan Hubinger to do more work here is marginally unhelpful; I weakly think quick takes like this are marginally good.
I will add: It’s odd to me, Stephen, that this is your line for (what I read as) disgust at Anthropic staff espousing extremely convenient positions while doing things that seem to you to be causing massive harm. To my knowledge the Anthropic leadership has ~never engaged in public dialogue about why they’re getting rich building potentially-omnicidal-minds with worthy critics like Hinton, Bengio, Russell, Yudkowsky, etc, so I wouldn’t expect them or their employees to have high standards for public defenses of far less risky behavior like working with the US military.[1]
As an example of the low standards for Anthropic’s public discourse, notice how a recent essay about what’s required for Anthropic to succeed at AI Safety by Sam Bowman (a senior safety researcher at Anthropic) flatly states “Our ability to do our safety work depends in large part on our access to frontier technology… staying close to the frontier is perhaps our top priority in Chapter 1” with ~no defense of this claim or engagement with the sorts of reasons that I consider adding a marginal competitor to the suicide race is an atrocity, or acknowledgement that this makes him personally very wealthy (i.e. he and most other engineers at Anthropic will make millions of dollars due to Anthropic acting on this claim).
No disagreement.
The community seems to be quite receptive to the opinion, it doesn’t seem unreasonable to voice an objection. If you’re saying it is primarily the way I’ve written it that makes it unhelpful, that seems fair.
I originally felt that either question I asked would be reasonably easy to answer, if time was given to evaluating the potential for harm.
However, given that Hubinger might have to run any reply by Anthropic staff, I understand that it might be negative to demand further work. This is pretty obvious, but didn’t occur to me earlier.
Ultimately, the original quicktake was only justifying one facet of Anthropic’s work so that’s all I’ve engaged with. It would seem less helpful to bring up my wider objections.
I don’t expect them to have a high standard for defending Anthropic’s behavior, but I do expect the LessWrong community to have a high standard for arguments.
Thanks for the responses, I have a better sense of how you’re thinking about these things.
I don’t feel much desire to dive into this further, except I want to clarify one thing, on the question of any demands in your comment.
That actually wasn’t primarily the part that felt like a demand to me. This was the part:
I’m not quite sure what the relevance of the time was if not to suggest it needed to be high. I felt that this line implied something like “If your answer is around ’20 hours’, then I want to say that the correct standard should be ‘200 hours’”. I felt like it was a demand that Hubinger may have to spend 10x the time thinking about this question before he met your standards for being allowed to express his opinion on it.
But perhaps you just meant you wanted him to include an epistemic status, like “Epistemic status: <Here’s how much time I’ve spent thinking about this question>”.
We live in an information society → “You” are trying to build the ultimate dual use information tool/thing/weapon → The government require your service. No news there. So why the need to whitewash this? What about this is actually bothering you?
This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.
What would be necessary to build a good auditing game benchmark?
How would AI safety AI work? What is necessary for it to go well?
How do we avoid end-to-end training while staying competitive with it? Can we use transparency on end-to-end models to identify useful modules to train non-end-to-end?
What would it look like to do interpretability on end-to-end trained probabilistic models instead of end-to-end trained neural networks?
Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”).
Argue for the importance of ensuring that the state-of-the-art in “using AI for alignment” never lags behind as a capability compared to where it could be given just additional engineering effort.
What does inner alignment look like in the context of models with access to memory (e.g. a retrieval database)?
Argue for doing scaling laws for phase changes. We have found some phase changes in models—e.g. the induction bump—but we haven’t yet really studied the extent to which various properties—e.g. Honesty—generalize across these sorts of phase changes.
Humans rewarding themselves for finishing their homework by eating candy suggests a plausible mechanism for gradient hacking.
If we see precursors to deception (e.g. non-myopia, self-awareness, etc.) but suspiciously don’t see deception itself, that’s evidence for deception.
The more model’s objectives vary depending on exact setup, randomness, etc., the less likely deceptive models are to want to cooperate with future deceptive models, thus making defection earlier more likely.
China is not a strategically relevant actor for AI, at least in short timeline scenarios—they are too far behind, their GDP isn’t growing fast enough, and their leaders aren’t very good at changing those things.
If you actually got a language model that was a true generative model of the world that you could get arbitrary conditionals from, that would be equivalent to having access to a quantum suicide machine.
Introduce the concept of how factored an alignment solution is in terms of how easy it is to turn up or down alignment relative to capabilities—or just swap out an aligned goal for a misaligned one—as an important axis to pay attention to. Currently, things are very factored—alignment and capabilities are both heavily dependent on dataset, reward, etc.—but that could change in the future.
Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
How has transparency changed over time—Chris claims it’s easier to interpret later models; is that true?
Which AI safety proposals are most likely to fail safely? Proposals which have the property that the most likely way for them to fail is just not to work are better than those that are most likely to fail catastrophically. In the former case, we’ve sacrificed some of our alignment tax, but still have another shot.
What are some plausible scenarios for how a model might be suboptimality deceptively aligned?
What can we learn about AI safety from the domestication of animals? Does the success of domesticating dogs from wolves provide an example of how to train for corrigibility? Or did we just make them dumber via the introduction of something like William’s syndrome?
I’ll continue to include more directions like this in the comments here.
I’d just make this a top level post.
I want this more as a reference to point specific people (e.g. MATS scholars) to than as something I think lots of people should see—I don’t expect most people to get much out of this without talking to me. If you think other people would benefit from looking at it, though, feel free to call more attention to it.
Mmm, maybe you’re right (I was gonna say “making a top-level post which includes ‘chat with me about this if you actually wanna work on one of these’”, but it then occurs to me you might already be maxed out on chat-with-people time, and it may be more useful to send this to people who have already passed some kind of ‘worth your time’ filter)
Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
One way to think about the core problem with relaxed adversarial training is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don’t know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.
Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won’t be), you get at most one extra bit of optimization towards alignment.
Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.
A deceptive model doesn’t have to have some sort of very explicit check for whether it’s in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it’s in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn’t really think about it very often because during training it just looks too unlikely.
Humans don’t wirehead because reward reinforces the thoughts which the brain’s credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized—reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like “there’s a lollipop in front of me” and “I’m picking it up”), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human’s beliefs about reality or about the activation of the reward system.
It seems that you’re postulating that the human brain’s credit assignment algorithm is so bad that it can’t tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.
No, I don’t intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?
Disclaimer: At the time of writing, this has not been endorsed by Evan.
I can give this a go.
Unpacking Evan’s Comment:
My read of Evan’s comment (the parent to yours) is that there are a bunch of learned high-level-goals (“strategies”) with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection (“thoughts directly related to the current action” or “tactics”) all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.
One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that “strategic cognition” is that which chooses bundles of context-dependent tactical policies, and “tactical cognition” is that which implements a given tactic’s choice of actions in response to some context.) This feels to me close to what Evan was suggesting you were saying is the case with humans.
One Vaguely Mechanistic Illustration of a Similar Concept:
A similar way for this to be broken in humans, departing just a bit from Evan’s comment, is if the credit assignment algorithm could identify tactical choices with strategies, but not equally reliably across all strategies. As a totally made up concrete and stylized illustration: Consider one evolutionarily-endowed credit-assignment-target: “Feel physically great,” and two strategies: wirehead with drugs (WIRE), or be pro-social (SOCIAL.) Whenever WIRE has control, it emits some tactic like “alone in my room, take the most fun available drug” which takes actions that result in Xw physical pleasure over a day. Whenever SOCIAL has control, it emits some tactic like “alone in my room, abstain from dissociative drugs and instead text my favorite friend” taking actions which result in Xs physical pleasure over a day.
Suppose also that asocial cognitions like “eat this” have poorly wired feed-back channels and the signal is often lost and so triggers credit-assignment only some small fraction of the time. Social cognition is much better wired-up and triggers credit-assignment every time. Whenever credit assignment is triggered, once a day, reward emitted is 1:1 with the amount of physical pleasure experienced that day.
Since WIRE only gets credit a fraction of the time that it’s due, the average reward (over 30 days, say) credited to WIRE is <<Xw∗30. If and only if Xw>>Xs, like if the drug is heroin or your friends are insufficiently fulfilling, WIRE will be reinforced more relative to SOCIAL. Otherwise, even if the drug is somewhat more physically pleasurable than the warm-fuzzies of talking with friends, SOCIAL will be reinforced more relative to WIRE.
Conclusion:
I think Evan is saying that he expects advanced reward-based AI systems to have no such impediments by default, even if humans do have something like this in their construction. Such a stylized agent without any signal-dropping would reinforce WIRE over SOCIAL every time that taking the drug was even a tiny bit more physically pleasurable than talking with friends.
Maybe there is an argument that such reward-aimed goals/strategies would not produce the most rewarding actions in many contexts, or for some other reason would not be selected for / found in advanced agents (as Evan suggests in encouraging someone to argue that such goals/strategies require concepts which are unlikely to develop,) but the above might be in the rough vicinity of what Evan was thinking.
REMINDER: At the time of writing, this has not been endorsed by Evan.
Thanks for the story! I may comment more on it later.
That seems to imply that humans would continue to wirehead conditional on that they started wireheading.
Yes, I think they indeed would.
About the following point:
“Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.”
Well, that seems to be what happened in the case of rats and probably many other animals. Stick an electrode into the reward center of the brain of a rat. Then give it a button to trigger the electrode. Now some rats will trigger their reward centers and ignore food.
Humans value their experience. A pleasant state of consciousness is actually intrinsically valuable to humans. Not that this is the only thing that humans value, but it is certainly a big part.
It is unclear how this would generalize to artificial systems. We don’t know if, or in what sense they would have experience, and why that would even matter in the first place. But I don’t think we can confidently say that something computationally equivalent to “valuing experience”, won’t be going on in artificial systems we are going to build.
So somebody picking this point would probably need to address this point and argue why artificial systems are different in this regard. The observation that most humans are not heroin addicts seems relevant. Though the human story might be different if there were no bad side effects and you had easy access to it. This would probably be more the situation artificial systems would find themselves in. Or in a more extreme case, imagine soma but you live longer.
In short: Is valuing experience perhaps computationally equivalent to valuing transistors storing the reward? Then there would be real-world examples of that happening.
I have a related draft on this.
Here’s a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:
When we train AI systems to be nice, we’re giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?
To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion)
Important disanalogies seem:
1) Most humans aren’t good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time.
2) The listener will assume that [score highly on niceness] isn’t the human’s only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating])
3) A fairly large proportion of humans are nice (I think!).
The second could be addressed somewhat by raising the stakes.
The first seems hard to remedy within this analogy. I’d be a little concerned that people initially buy it, then think for themselves and conclude “But if we design a really clever niceness test, then it’d almost always work—all we need is clever people to work for a while on some good tests”.
Combined with (3), this might seem like a decent solution.
Overall, I think what’s missing is that we’d expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn’t going to have this intuition in the human-niceness-test case.
I expect this is very susceptible to opinions about human nature. To someone who thinks humans ARE generally nice, they are likely to answer “yes, of course” to your question. To someone who thinks humans are generally extremely context-sensitive, which appears to be nice in the co-evolved social settings in which we generally interact, the answer is “who knows?”. But the latter group doesn’t need to be convinced, we’re already worried.
Surely nobody thinks that all humans are nice all the time and nobody would ever fake a niceness exam. I mean, I think humans are generally pretty good, but obviously that always has to come with a bunch of caveats because you don’t have to look very far into human history to see quite a lot of human-committed atrocities.
I think the answer is an obvious yes, all other things held equal. Of course, what happens in reality is more complex than this, but I’d still say yes in most cases, primarily because I think that aligned behavior is very simple, so simple that it either only barely loses out to the deceptive model or outright has the advantage, depending on the programming language and low level details, and thus we only need to transfer 300-1000 bits maximum, which is likely very easy.
Much more generally, my fundamental claim is that the complexity of pointing to human values is very similar to the set of all long-term objectives, and can be easier or harder, but I don’t buy the assumption that pointing to human values is way harder than pointing to the set of long-term goals.
I think that sticking to capitalism as an economic system post-singularity would be pretty clearly catastrophic and something to strongly avoid, despite capitalism working pretty well today. I’ve talked about this a bit previously here, but some more notes on why:
Currently, our society requires the labor of sentient beings to produce goods and services. Capitalism incentivizes that labor by providing a claim on society’s overall production in exchange for it. If the labor of sentient beings becomes largely superfluous as an economic input, however, then having a system that effectively incentivizes that labor also becomes largely superfluous.
Currently, we rely on the mechanism of price discovery to aggregate and disseminate information about the optimal allocation of society’s resources. But it’s far from an optimal mechanism for allocating resources, and a superintelligence with full visibility and control could do a much better job of resource allocation without falling prey to common pitfalls of the price mechanism such as externalities.
Capitalism incentivizes the smart allocation of capital in the same way it incentivizes labor. If society can make smart capital allocation decisions without relying on properly incentivized investors, however, then as with labor there’s no reason to keep such an incentive mechanism.
While very large, the total optimization pressure humanity puts into economic competition today would likely pale in comparison to that of a post-singularity future. In the context of such a large increase in optimization pressure, we should generally expect extremal Goodhart failures.
More specifically, competitive dynamics incentivize the reinvestment of all economic proceeds back into resource acquisition lest you be outcompeted by another entity doing so. Such a dynamic results in pushing out actors that reinvest proceeds into the flourishing of sentient beings in exchange for those that disregard any such investment in favor of more resource acquisition.
Furthermore, the proceeds of post-singularity economic expansion flowing to the owners of existing capital is very far from socially optimal. It strongly disfavors future generations, simulated humans, and overall introduces a huge amount of variance into whether we end up with a positive future, putting a substantial amount of control into a set of people whose consumption decisions need not align with the socially optimal allocation.
Capitalism is a complex system with many moving parts, some of which are sometimes assumed to consist of the entirety of what defines it. What kinds of components do you see as being highly unlikely to be included in a successful utopia, and what components could be internal to a well functioning system as long as (potentially-left-unspecified) conditions are met? I could name some kinds of components (eg some kinds of contracts or enforcement mechanisms) that I expect to not be used in a utopia, though I suspect at this point you’ve seen my comments where I get into this, so I’m more interested in what you say without that prompting.
Who’s this “we” you’re talking about? It doesn’t seem to be any actual humans I recognize. As far as I can tell, the basics of capitalism (call it “simple capitalism”) are just what happens when individuals make decisions about resource use. We call it “ownership”, but really any form of resolution of the underlying conflict of preferences would likely work out similarly. That conflict is that humans have unbounded desires, and resources are bounded.
The drive to make goods and services for each other, in pursuit of selfish wants, does incentivize labor, but it’s not because “society requires” it, except in a pretty blurry aggregation of individual “requires”. Price discovery is only half of what market transactions do. The other half is usage limits and resource distribution. These are sides of a coin, and can’t be separated—without limited amounts, there is no price, without prices there is no agent-driven exchange of different kinds of resource.
I’m with you that modern capitalism is pretty unpleasant due to optimization pressure, and due to the easy aggregation of far more people and resources than historically possible, and than human culture was evolved around. I don’t see how the underlying system has any alternative that doesn’t do away with individual desire and consumption. Especially the relative/comparative consumption that seems to drive a LOT of perceived-luxury requirements.
I think some version of distributing intergalactic property rights uniformly (e.g. among humans existing in 2023) combined with normal capitalism isn’t clearly that catastrophic. (The distribution is what you call the egalitarian/democratic solution in the link.)
Maybe you lose about a factor of 5 or 10 over the literally optimal approach from my perspective (but maybe this analysis is tricky due to two envelope problems).
(You probably also need some short term protections to avoid shakedowns etc.)
Are you pessimstic that people will bother reflecting or thinking carefully prior to determing resource utilization or selling their property? I guess I feel like 10% of people being somewhat thoughtful matches the rough current distribution of extremely rich people.
If the situation was something like “current people, weighted by wealth, deliberate for a while on what to do with our resources” then I agree that’s probably like 5 − 10 times worse than the best approach (which is still a huge haircut) but not clearly catastrophic. But it’s not clear to me that’s what the default outcome of competitive dynamics would look like—sufficiently competitive dynamics could force out altruistic actors if they get outcompeted by non-altruistic actors.
I think one crux between you and I, at least, is that you see this as a considered division of how to divide resources, and I see it as an equilibrium consensus/acceptance of what property rights to enforce in maintenance, creation, and especially transfer of control/usage of resources. You think of static division, I think of equilibria and motion. Both are valid, but experience and resource use is ongoing and it must be accounted for.
I’m happy that the modern world generally approves of self-ownership: a given individual gets to choose what to do (within limits, but it’s nowhere near the case that my body and mind are part of the resources allocated by whatever mechanism is being considered). It’s generally considered an alignment failure if individual will is just a resource that the AI manages. Physical resources (and behavioral resources, which are a sale of the results of some human efforts, a distinct resource from the mind performing the action) are generally owned by someone, and they trade some results to get the results of other people’s resources (including their labor and thought-output).
There could be a side-mechanism for some amount of resources just for existing, but it’s unlikely that it can be the primary transfer/allocation mechanism, as long as individuals have independent and conflicting desires. Current valuable self-owned products (office work, software design, etc.) probably reduces in value a lot. If all human output becomes valueless (in the “tradable for other desired things or activities” sense of valuable), I don’t think current humans will continue to exist.
Wirehead utopia (including real-world “all desires fulfilled without effort or trade”) doesn’t sound appealing or workable for what I know of my own and general human psychology.
for most people, this is just the right to sell their body to the machine. better than being forced at gunpoint, but being forced to by an empty fridge is not that much better, especially with monopoly accumulation as the default outcome. I agree that being able to mark ones’ selfhood boundaries with property contracts is generally good, but the ability to massively expand ones’ property contracts to exclude others from resource access is effectively a sort of scalping—sucking up resources so as to participate in an emergent cabal of resource withholding. In other words,
The core argument that there’s something critically wrong with capitalism is that the stock market has been an intelligence aggregation system for a long time and has a strong tendency to suck up the air in the system.
Utopia would need to involve a load balancing system that can prevent sucking-up-the-air type resource control imbalancing, so as to prevent
I think this is a big point of disagreement. For most people, there’s some amount of time/energy that’s sold to the machine, and it’s NOWHERE EVEN CLOSE to selling their actual asset (body and mind). There’s a LOT of leisure time, and a LOT of freedom even within work hours, and the choice to do something different tomorrow. It may not be as rewarding, but it’s available and the ability to make those decisions has not been sold or taken.
yeah like, above a certain level of economic power that’s true, but the overwhelming majority of humans are below that level, and AI is expected to raise that waterline. it’s kind of the primary failure mode I expect.
I mean, the 40 hour work week movement did help a lot. But it was an instance of a large push of organizing to demand constraint on what the aggregate intelligence (which at the time was the stock market—which is a trade market of police-enforceable ownership contracts), could demand of people who were not highly empowered. And it involved leveling a lopsided playing field by things that one side considered dirty tricks, such as strikes. I don’t think that’ll help against AI, to put it lightly.
To be clear, I recognize that your description is accurate for a significant portion of people. But it’s not close to the majority, and movement towards making it the majority has historically demanded changing the enforceable rules in a way that would reliably constrain the aggregate agency of the high dimensional control system steering the economy. When we have a sufficiently much more powerful one of those is when we expect failure, and right now it doesn’t seem to me that there’s any movement on a solution to that. We can talk about “oh we need something better than capitalism” but the problem with the stock market is simply that it’s enforceable prediction, thereby sucking up enough air from the room that a majority of people do not get the benefits you’re describing. If they did, then you’re right, it would be fine!
I mean, also there’s this, but somehow I expect that that won’t stick around long after robots are enough cheaper than humans
I think we’re talking past each other a bit. It’s absolutely true that the vast majority historically and, to a lesser extent, in modern times, are pretty constrained in their choices. This constraint is HIGHLY correlated with distance from participation in voluntary trade (of labor or resources).
I think the disconnect is the word “capitalism”—when you talk about stock markets and price discovery, that says to me you’re thinking of a small part of the system. I fully agree that there are a lot of really unpleasant equilibra with the scale and optimization pressure of the current legible financial world, and I’d love to undo a lot of it. But the underlying concept of enforced and agreed property rights and individual human decisions is important to me, and seems to be the thing that gets destroyed first when people decry capitalism.
Ok, it sounds, even to me, like “The heads. You’re looking at the heads. Sometimes he goes too far. He’s the first one to admit it.” But really, I STRONGLY expect that I am experiencing peak human freedom RIGHT NOW (well, 20 years ago, but it’s been rather flat for me and my cultural peers for a century, even if somewhat declining recently), and capitalism (small-c, individual decisions and striving, backed by financial aggregation with fairly broad participation) has been a huge driver of that. I don’t see any alternatives that preserve the individuality of even a significant subset of humanity.
If property rights to the stars are distributed prior to this, why does this competition cause issues? Maybe you basically agree here, but think it’s unlikely property will be distributed like this.
Separately, for competitive dynamics with reasonable rule of law and alignment ~solved, why do you think the strategy stealing assumption won’t apply? (There are a bunch of possible objections here, just wondering what your’s is. Personally I think strategy stealing is probably fine if the altruistic actors care about the long run and are strategic.)
Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the “if models are doing some bad behavior, the creators should be able to tell us why” point a bunch and it’s in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.
The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go “well that sure have everyone a false sense of security”.
Yeah I also felt some vague optimism about that.
If you want to better understand counting arguments for deceptive alignment, my comment here might be a good place to start.
Epistemic status: random philosophical musings.
Assuming a positive singularity, how should humanity divide its resources? I think the obvious (and essentially correct) answer is “in that situation, you have an aligned superintelligence, so just ask it what to do.” But I nevertheless want to philosophize a bit about this, for one main reason.
That reason is: an important factor imo in determining the right thing to do in distributing resources post-singularity is what incentives that choice of resource allocation creates for people pre-singularity. For those incentives to work, though, we have to actually be thinking about this now, since that’s what allows the choice of resource distribution post-singularity to have its acausal influence on our choices pre-singularity. I will note that this is definitely something that I think about sometimes, and something that I think a lot of other people also implicitly think about sometimes when they consider things like amassing wealth, specifically gaining control over current AIs, and/or the future value of their impact certificates.
So, what are some of the possible options for how to distribute resources post-singularity? Let’s go over some of the various possible solutions here and why I don’t think any of the obvious things here are what you want:
The evolutionary/capitalist solution: divide future resources in proportion to control of current resources (e.g. AIs). This is essentially what happens by default if you keep in place an essentially capitalist system and have all the profits generated by your AIs flow to the owners of those AIs. Another version of this is a more power/bargaining-oriented version where you divide resources amongst agents in proportion to the power those agents could bring to bear if they chose to fight for those resources.
The most basic problem with this solution is that it’s a moral catastrophe if the people that get all the resources don’t do good things with them. We should not want to build AIs that lead to this outcome—and I wouldn’t really call AIs that created this outcome aligned.
Another more subtle problem with this solution is that it creates terrible incentives for current people if they expect this to be what happens, since it e.g. incentivizes people to maximize their personal control over AIs at the expense of spending more resources trying to align those AIs.
I feel like I see this sort of thinking a lot and I think that if we were to make it more clear that this is never what should happen in a positive singularity that then people would do this sort of thing less.
The egalitarian/democratic solution: divide resources equally amongst all current humans. This is what naive preference utilitarianism would do.
Though it might be less obvious than with the first solution, I think this solution also leads to a moral catastrophe, since it cements current people as oligarchs over future people, leads to value lock-in, and could create a sort of tyranny of the present.
This solution also creates some weird incentives for trying to spread your ideals as widely as possible and to create as many people as possible that share your ideals.
The unilateralist/sovereign/past-agnostic/CEV solution: concentrate all resources under the control of your aligned AI(s), then distribute those resources in accordance with how they generate the most utility/value/happiness/goodness/etc., without any special prejudice given to existing people.
In some sense, this is the “right” thing to do, and it’s pretty close to what I would ideally want. However, it has a couple of issues:
Though, unlike the first solution, it doesn’t create any perverse incentives right now, it doesn’t create any positive incentives either.
Since this solution doesn’t give any special prejudice to current people, it might be difficult to get current people to agree to this solution, if that’s necessary.
The retroactive impact certificate solution: divide future resources in proportion to retroactively-assessed social value created by past agents.
This solution obviously creates the best incentives for current agents, so in that sense it does very well.
However, it still does pretty poorly on potentially creating a moral catastrophe, since the people that created the most social value in the past need not continue doing so in the future.
As above, I don’t think that you should want your aligned AI to implement any of these particular solutions. I think some combination of (3) and (4) is probably the best out of these options, though of course I’m sure that if you actually asked an aligned superintelligent AI it would do better than any of these. More broadly, though, I think that it’s important to note that (1), (2), and (4) are all failure stories, not success stories, and you shouldn’t expect them to happen in any scenario where we get alignment right.
Circling back to the original reason that I wanted to discuss all of this, which is how it should influence our decisions now:
Obviously, the part of your values that isn’t selfish should continue to want things to go well.
However, for the part of your values that cares about your own future resources, if that’s something that you care about, how you go about maximizing that is going to depend on what you most expect between (1), (2), and (4).
First, in determining this, you should condition on situations where you don’t just die or are otherwise totally disempowered, since obviously those are the only cases where this matters. And if that probability is quite high, then presumably a lot of your selfish values should just want to minimize that probability.
However, going ahead anyway and conditioning on everyone not being dead/disempowered, what should you expect? I think that (1) and (2) are possible in worlds where get some parts of alignment right, but overall are pretty unlikely: it’s a very narrow band of not-quite-alignment that gets you there. So probably if I cared about this a lot I’d focus more on (4) than (1) and (2).
Which of course gets me to why I’m writing this up, since that seems like a good message for people to pick up. Though I expect it to be quite difficult to effectively communicate this very broadly.
Disagree. I’m in favor of (2) because I think that what you call a “tyranny of the present” makes perfect sense. Why would the people of the present not maximize their utility functions, given that it’s the rational thing for them to do by definition of “utility function”? “Because utilitarianism” is a nonsensical answer IMO. I’m not a utilitarian. If you’re a utilitarian, you should pay for your utilitarianism out of your own resource share. For you to demand that I pay for your utilitarianism is essentially a defection in the decision-theoretic sense, and would incentivize people like me to defect back.
As to problem (2.b), I don’t think it’s a serious issue in practice because time until singularity is too short for it to matter much. If it was, we could still agree on a cooperative strategy that avoids a wasteful race between present people.
Even if you don’t personally value other people, if you’re willing to step behind the veil of ignorance with respect to whether you’ll be an early person or a late person, it’s clearly advantageous before you know which one you’ll be to not allocate all the resources to the early people.
First, I said I’m not a utilitarian, I didn’t say that I don’t value other people. There’s a big difference!
Second, I’m not willing to step behind that veil of ignorance. Why should I? Decision-theoretically, it can make sense to argue “you should help agent X because in some counterfactual, agent X would be deciding whether to help you using similar reasoning”. But, there might be important systematic differences between early people and late people (for example, because late people are modified in some ways compared to the human baseline) which break the symmetry. It might be a priori improbable for me to be born as a late person (and still be me in the relevant sense) or for a late person to be born in our generation[1].
Moreover, if there is a valid decision-theoretic argument to assign more weight to future people, then surely a superintelligent AI acting on my behalf would understand this argument and act on it. So, this doesn’t compel me to precommit to a symmetric agreement with future people in advance.
There is a stronger case for intentionally creating and giving resources to people who are early in counterfactual worlds. At least, assuming people have meaningful preferences about the state of never-being-born.
If a future decision is to shape the present, we need to predict it.
The decision-theoretic strategy “Figure out where you are, then act accordingly.” is merely an approximation to “Use the policy that leads to the multiverse you prefer.”. You *can* bring your present loyalties with you behind the veil, it might just start to feel farcically Goodhartish at some point.
There are of course no probabilities of being born into one position or another, there are only various avatars through which your decisions affect the multiverse. The closest thing to probabilities you’ll find is how much leverage each avatar offers: The least wrong probabilistic anthropics translates “the effect of your decisions through avatar A is twice as important as through avatar B” into “you are twice as likely to be A as B”.
So if we need probabilities of being born early vs. late, we can compare their leverage. We find:
Quantum physics shows that the timeline splits a bazillion times a second. So each second, you become a bazillion yous, but the portions of the multiverse you could first-order impact are divided among them. Therefore, you aren’t significantly more or less likely to find yourself a second earlier or later.
Astronomy shows that there’s a mazillion stars up there. So we build a Dyson sphere and huge artificial womb clusters, and one generation later we launch one colony ship at each star. But in that generation, the fate of the universe becomes a lot more certain, so we should expect to find ourselves before that point, not after.
Physics shows that several constants are finely tuned to support organized matter. We can infer that elsewhere, they aren’t. Since you’d think that there are other, less precarious arrangements of physical law with complex consequences, we can also moderately update towards that very precariousness granting us unusual leverage about something valuable in the acausal marketplace.
History shows that we got lucky during the Cold War. We can slightly update towards:
Current events are important.
Current events are more likely after a Cold War.
Nuclear winter would settle the universe’s fate.
The news show that ours is the era of inadequate AI alignment theory. We can moderately update towards being in a position to affect that.
When you start diverting significant resources away from #1, you’ll probably discover that the definition of “aligned” is somewhat in contention.
i feel like (2)/(3) is about “what does (the altruistic part of) my utility function want?” and 4 is “how do i decision-theoretically maximize said utility function?”. they’re different layers, and ultimately it’s (2)/(3) we want to maximize, but maximizing (2)/(3) entails allocating some of the future lightcore to (4).
A couple of thoughts:
I think that (3) does create strong incentives right now—at least for anyone who assumes [without any special prejudice given to existing people] amounts to [and it’s fine to disassemble everyone who currently exists if it’s the u/v/h/g/etc maximising policy]. This seems probable to me, though not entirely clear (I’m not an optimal configuration, and smoothly, consciousness-preservingly transitioning me to something optimal seems likely to take more resources than unceremoniously recycling me).
Incentives now include:
Prevent (3) happening.
To the extent that you expect (3) and are selfish, live for the pre-(3) time interval, for (3) will bring your doom.
On (4), “This solution obviously creates the best incentives for current agents” seems badly mistaken unless I’m misunderstanding you.
Something in this spirit would need to be based on a notion of [expected social value], not on actual contributions, since in the cases where we die we don’t get to award negative points.
For example, suppose my choice is between:
A: {90% chance doom for everyone; 10% I save the world}
B: {85% chance doom for everyone; 15% someone else saves the world}
To the extent that I’m selfish, and willing to risk some chance of death for greater control over the future, I’m going to pick A under (4).
The more selfish, reckless and power-hungry I am, and the more what I want deviates from that most people want, the more likely I am to actively put myself in position to take an A-like action.
Moreover, if the aim is to get ideal incentives, it seems unavoidable to have symmetry and include punishments rather than only [you don’t get many resources]. Otherwise the incentive is to shoot for huge magnitude of impact, without worrying much about the sign, since no-one can do worse than zero resources.
If correct incentives were the only desideratum, I don’t see how we’d avoid [post-singularity ‘hell’ (with some probability) for those who’re reckless with AGI].
For any nicer approach I think we’d either be incenting huge impact with uncertain sign, or failing to incent large sacrifice in order to save the world.
Perhaps the latter is best??
I.e. cap the max resources for any individual at a fairly low level, so that e.g. [this person was in the top percentile of helpfulness] and [this person saved the world] might get you about the same resource allocation.
It has the upsides both of making ‘hell’ less necessary, and of giving a lower incentive to overconfident people with high-impact schemes. (but still probably incents particularly selfish people to pick A over B)
(some very mild spoilers for yudkowsky’s planecrash glowfic (very mild as in this mostly does not talk about the story, but you could deduce things about where the story goes by the fact that characters in it are discussing this))
[edit: links in spoiler tags are bugged. in the spoiler, “speculates about” should link to here and “have the stance that” to here]
“The Negative stance is that everyone just needs to stop calculating how to pessimize anybody else’s utility function, ever, period. That’s a simple guideline for how realness can end up mostly concentrated inside of events that agents want, instead of mostly events that agents hate.”
“If at any point you’re calculating how to pessimize a utility function, you’re doing it wrong. If at any point you’re thinking about how much somebody might get hurt by something, for a purpose other than avoiding doing that, you’re doing it wrong.)”
i think this is a pretty solid principle. i’m very much not a fan of anyone’s utility function getting pessimized.
so pessimising a utility function is a bad idea. but we can still produce correct incentive gradients in other ways! for example, we could say that every moral patient starts with 1 unit of utility function handshake, but if you destroy the world you lose some of your share. maybe if you take actions that cause ⅔ of timelines to die, you only get ⅓ units of utility function handshake, and the more damage you do the less handshake you get.
it never gets into the negative, that way we never go out of our way to pessimize someone’s utility function; but it does get increasingly close to 0.
(this isn’t necessarily a scheme i’m committed to, it’s just an idea i’ve had for a scheme that provides the correct incentives for not destroying the world, without having to create hells / pessimize utility functions)
Hmmm, I don’t think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it—in which case almost everyone is going to get approximately no influence).
More fundamentally, I don’t think it works out in this kind of case due to logical uncertainty.
If I’m uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that’s not {in 80% of timelines everyone dies; in 20% of timelines I save the world}.
It’s closer to [there’s an 80% chance that {in ~99% of timelines everyone dies}; there’s a 20% chance that {in ~99% of timelines I save the world}].
So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won’t disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let’s train this model and see what happens] - most of the danger there being a logical uncertainty thing)
I think influence would need to be based on expected social value given the ‘correct’ level of logical uncertainty—probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you’d make for them based on information you have].
Or at least some subjective perspective seems to be necessary—and something that doesn’t give more points for overconfident people.
Here’s a simple argument that I find quite persuasive for why you should have linear returns to whatever your final source of utility is (e.g. human experience of a fulfilling life, which I’ll just call “happy humans”). Note that this is not an argument that you should have linear returns to resources (e.g. money). The argument goes like this:
You have some returns to happy humans (or whatever else you’re using as your final source of utility) in terms of how much utility you get from some number of happy humans existing.
In most cases, I think those returns are likely to be diminishing, but nevertheless monotonically increasing and differentiable. For example, maybe you have logarithmic returns to happy humans.
We happen to live in a massive multiverse. (Imo the Everett interpretation is settled science, and I don’t think you need to accept anything else to make this go through, but note that we’re only depending on the existence of any sort of big multiverse here—the one that the Everett interpretation gives you is just the only one that we know is guaranteed to actually exist.)
In a massive multiverse, the total number of happy humans is absolutely gigantic (let’s ignore infinite ethics problems, though, and assume it’s finite—though I think this argument still goes through in the infinite case, it just then depends on whatever infinite ethics framework you like).
Furthermore, the total number of happy humans is mostly insensitive to anything you can do, or anything happening locally within this universe, since this universe is only a tiny fraction of the overall multiverse. (Though you could get out of this by claiming that what you really care about is happy humans per universe, that’s a pretty strange thing to care about—it’s like caring about happy humans per acre.)
As a result, the effective returns to happy humans that you are exposed to within this universe reflect only the local behavior of your overall returns. (Note that this assumes “happy humans” are fungible, which I don’t actually believe—I care about the overall diversity of human experience throughout the multiverse. However, I don’t think that changes the bottom line conclusion, since, if anything, centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible.)
As anyone who has taken any introductory calculus will know, the local behavior of any differentiable function is linear.
Since we assumed that your overall returns were differentiable and monotonically increasing, the local returns must be linear with a positive slope.
You’re assuming that your utility function should have the general form of valuing each “source of utility” independently and then aggregating those values (such that when aggregating you no longer need the details of each “source” but just their values). But in The Moral Status of Independent Identical Copies I found this questionable (i.e., conflicting with other intuitions).
This is the fungibility objection I address above:
Ah, I think I didn’t understand that parenthetical remark and skipped over it. Questions:
I thought your bottom line conclusion was “you should have linear returns to whatever your final source of utility is” and I’m not sure how “centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible” relates to that.
I’m not sure that the way my utility function deviates from fungibility is “I care about overall diversity of human experience throughout the multiverse”. What if it’s “I care about diversity of human experience in this Everett branch” then I could get a non-linear diminishing returns effect where as humans colonize more stars or galaxies, each new human experience is more likely to duplicate an existing human experience or be too similar to an existing experience so that its value has to be discounted.
The thing I was trying to say there is that I think the non-fungibility concern pushes in the direction of superlinear rather than sublinear local returns to “happy humans” per universe. (Since concentrating the “happy humans” likely makes it easier to ensure that they’re all different.)
I agree that this will depend on exactly in what way you think your final source of utility is non-fungible. I would argue that “diversity of human experience in this Everett branch” is a pretty silly thing to care about, though. I don’t see any reason why spatial distance should behave differently than being in separate Everett branches here.
I tried to explain my intuitions/uncertainty about this in The Moral Status of Independent Identical Copies (it was linked earlier in this thread).
I read it, and I think I broadly agree with it, but I don’t know why you think it’s a reason to treat physical distance differently to Everett branch distance, holding diversity constant. The only reason that you would want to treat them differently, I think, is if the Everett branch happy humans are very similar, whereas the physically separated happy humans are highly diverse. But, in that case, that’s an argument for superlinear local returns to happy humans, since it favors concentrating them so that it’s easier to make them as diverse as possible.
I have a stronger intuition for “identical copy immortality” when the copies are separated spatially instead of across Everett branches (the latter also called quantum immortality). For example if you told me there are 2 identical copies of Earth spread across the galaxy and 1 of them will instantly disintegrate, I would be much less sad than if you told me that you’ll flip a quantum coin and disintegrate Earth if it comes up heads.
I’m not sure if this is actually a correct intuition, but I’m also not sure that it’s not, so I’m not willing to make assumptions that contradict it.
Not sure about this. Even if I think I am only acting locally, my actions and decisions could have an effect on the larger multiverse. When I do something to increase happy humans in my own local universe, I am potentially deciding / acting for everyone in my multiverse neighborhood who is similar enough to me to make similar decisions for similar reasons.
I agree that this is the main way that this argument could fail. Still, I think the multiverse is too large and the correlation not strong enough across very different versions of the Earth for this objection to substantially change the bottom line.
My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe sufficient criteria to recognise a human, but for the latter, you need to nail down exact physical location or some other exact criteria that distinguishes a specific human from every other human.
I agree that UDASSA might introduce a small effect like this, but my guess is that the overall effect isn’t enough to substantially change the bottom line. Fundamentally, being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty.
Maybe? I don’t really know how to reason about this.
If that’s true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion.
Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you’re at a different point in the risk-returns curve. So when comparing different logical ways the universe could be, you should not always care about the worlds where you can affect more sentient beings. If you have diminishing marginal returns, you need to be thinking about some more complicated function that is about whether you have a comparative advantage at affecting more sentient beings in worlds where there is overall fewer sentient beings (as measured by some measure that can handle infinities). Which matters for stuff like whether you should bet on the universe being large.
Reply to the objections before you say that.
I wrote up some of my thoughts on how to think about comparing the complexities of different model classes here.
If you want to produce warning shots for deceptive alignment, you’re faced with a basic sequencing question. If the model is capable of reasoning about its training process before it’s capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won’t be detectable.