Can you further commit to removing any other provisions from employment agreements that could be used to penalize employees who publicly raise concerns about company practices, such as the ability to prevent employees from selling their equity in private “tender offer” events?
and OpenAI’s reply just repeats the we-don’t-cancel-equity thing:
OpenAI has never canceled a current or former employee’s vested equity. The May and July communications to current and former employees referred to above confirmed that OpenAI would not cancel vested equity, regardless of any agreements, including non-disparagement agreements, that current and former employees may or may not have signed, and we have updated our relevant documents accordingly.
One thing in OpenAI’s letter is object-level notable: they deny that they ever committed compute to Superalignment.
To further our safety research agenda, last July we committed to allocate at least 20 percent of the computing resources we had secured to AI safety over a multi-year period. This commitment was always intended to apply to safety efforts happening across our company, not just to a specific team. We continue to uphold this commitment.
Altman tweeted the same thing at the time the letter was published.
I think this is straightforward gaslighting. I didn’t find a super-explicit quote from OpenAI or its leadership that the compute was for Superalignment, but:
The announcement was pretty clear: > Introducing Superalignment > We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort.
As far as I know, everyone—including OpenAI people and people close to OpenAI—interpreted the compute commitment as for Superalignment
I never heard it suggested that it was for not-just-superalignment
Sidenote on less explicit deception — the “20%” thing: most people are confused about 20% of compute secured in July 2023, to be used over four years vs 20% of compute, and when your announcement is confusing and indeed most people are confused and you fail to deconfuse them, you’re kinda culpable. OpenAI continues to fail to clarify this — e.g. here the senators asked “Does OpenAI plan to honor its previous public commitment to dedicate 20 percent of its computing resources to research on AI safety?” and OpenAI replied “last July we committed to allocate at least 20 percent of the computing resources we had secured to AI safety over a multi-year period.” This sentence is close to the maximally misleading way to say the commitment was only for compute we’d secured in July 2023, and we don’t have to use it for safety until 2027.
I’m confused by this reply — even pretending OpenAI is totally ruthless, I’d think it’s not incentivized to exclude people from tender offers, and moreover it’s incentivized to clarify that. Leaving it ambiguous leaves ex-employees in a little more fear of OpenAI excluding them (even though presumably OpenAI never would, since it would look sooo bad after e.g. Altman said “vested equity is vested equity, full stop”), but it looks bad to people like me and the senators...
Clarification on the Superalignment commitment: OpenAI said:
We are dedicating 20% of the compute we’ve secured to date over the next four years to solving the problem of superintelligence alignment. Our chief basic research bet is our new Superalignment team, but getting this right is critical to achieve our mission and we expect many teams to contribute, from developing new methods to scaling them up to deployment.
The commitment wasn’t compute for the Superalignment team—it was compute for superintelligence alignment. (As opposed to, in my view, work by the posttraining team and near-term-focused work by the safety systems team and preparedness team.) Regardless, OpenAI is not at all transparent about this, and they violated the spirit of the commitment by denying Superalignment compute or a plan for when they’d get compute, even if the literal commitment doesn’t require them to give any compute to safety until 2027.
Also, they failed to provide the promised fraction of compute to the Superalignment team (and not because it was needed for non-Superalignment safety stuff).
Update, five days later: OpenAI published the GPT-4o system card, with most of what I wanted (but kinda light on details on PF evals).
OpenAI Preparedness scorecard
Context:
OpenAI’s Preparedness Framework says OpenAI will maintain a public scorecard showing their current capability level (they call it “risk level”), in each risk category they track, before and after mitigations.
When OpenAI released GPT-4o, it said “GPT-4o does not score above Medium risk in any of these categories” but didn’t break down risk level by category.
(I’ve remarked on this repeatedly. I’ve also remarked that the ambiguity suggests that OpenAI didn’t actually decide whether 4o was Low or Medium in some categories, but this isn’t load-bearing for the OpenAI is not following its plan proposition.)
News: a week ago,[1] a “Risk Scorecard” section appeared near the bottom of the 4o page. It says:
Updated May 8, 2024
As part of our Preparedness Framework, we conduct regular evaluations and update scorecards for our models. Only models with a post-mitigation score of “medium” or below are deployed.The overall risk level for a model is determined by the highest risk level in any category. Currently, GPT-4o is assessed at medium risk both before and after mitigation efforts.
This is not what they committed to publish. It’s quite explicit that the scorecard should show risk in each category, not just overall.[2]
(What they promised: a real version of the image below. What we got: the quote above.)
Additionally, they’re supposed to publish their evals and red-teaming.[3] But OpenAI has still said nothing about how it evaluated 4o.
Most frustrating is the failure to acknowledge that they’re not complying with their commitments. If they were transparent and said they’re behind schedule and explained their current plan, that would probably be fine. Instead they’re claiming to have implemented the PF and to have evaluated 4o correctly and publicly taking credit for that and ignoring the issues.
“As a part of our Preparedness Framework, we will maintain a dynamic (i.e., frequently updated) Scorecard that is designed to track our current pre-mitigation model risk across each of the risk categories, as well as the post-mitigation risk.”
“Scorecard, in which we will indicate our current assessments of the level of risk along each tracked risk category”
This is not as explicit in the PF, but they’re supposed to frequently update the scorecard section of the PF, and the scorecard section is supposed to describe their evals.
Publicly report model or system capabilities, limitations, and domains of appropriate and inappropriate use, including discussion of societal risks, such as effects on fairness and bias[:] . . . . publish reports for all new significant model public releases . . . . These reports should include the safety evaluations conducted (including in areas such as dangerous capabilities, to the extent that these are responsible to publicly disclose) . . . and the results of adversarial testing conducted to evaluate the model’s fitness for deployment [and include the “red-teaming and safety procedures”].
Coda: yay OpenAI for publishing the GPT-4o system card, including eval results and the scorecard they promised! (Minus the “Unknown Unknowns” row but that row never made sense to me anyway.)
OpenAI reportedly rushed the GPT-4o evals. This article makes it sound like the problem is not having enough time to test the final model. I don’t think that’s necessarily a huge deal — if you tested a recent version of the model and your tests have a large enough safety buffer, it’s OK to not test the final model at all.
But there are several apparent issues with the application of the Preparedness Framework (PF) for GPT-4o (not from the article):
They instead said “GPT-4o does not score above Medium risk in any of these categories.” (Maybe they didn’t actually decide whether it’s Low or Medium in some categories!)
While rushing testing of the final model would be OK in some circumstances, OpenAI’s PF is supposed to ensure safety by testing the final model before deployment. (This contrasts with Anthropic’s RSP, which is supposed to ensure safety with its “safety buffer” between evaluations and doesn’t require testing the final model.) So OpenAI committed to testing the final model well and its current safety plan depends on doing that.
[Edit: this may be ambiguous: OpenAI explicitly committed to test every 2x increase in effective training compute, but maybe it merely strongly suggests that it’s supposed to test all final models (“We will evaluate all our frontier models, including at every 2x effective compute increase during training runs”; “Only models with a post-mitigation score of ‘medium’ or below can be deployed”). This is mostly moot in this case, since here they claimed to evaluate GPT-4o, not an earlier version.]
[Edit: also maybe lack of third-party audits, but they didn’t commit to do audits at any particular frequency]
I am also frustrated by the current underwhelming state of safety evals being done in general and in particular for GPT-4o.
I do think it’s worth mentioning that privately sharing eval results with the Federal government wouldn’t be evident to the general public. I hope that OpenAI is privately sharing more details than they are releasing publicly.
The fact that the public can’t know whether this is the case is a problem. A potential solution might be for the government to report on their take on whether a new frontier model is “in compliance with teporting standards” or not. That way, even though the evals were private, the public would know if the government had received its private reports.
Thanks. Is this because of posttraining? Ignoring posttraining, I’d rather that evaluators get the 90% through training model version and are unrushed than the final version and are rushed — takes?
two versions with the same posttraining, one with only 90% pretraining are indeed very similar, no need to evaluate both. It’s likely more like one model with 80% pretraining and 70% posttraining of the final model, and the last 30% of posttraining might be significant
Woah. OpenAI antiwhistleblowing news seems substantially more obviously-bad than the nondisparagement-concealed-by-nondisclosure stuff. If page 4 and the “threatened employees with criminal prosecutions if they reported violations of law to federal authorities” part aren’t exaggerated, it crosses previously-uncrossed integrity lines. H/t Garrison Lovely.
[Edit: probably exaggerated; see comments. But I haven’t seen takes that the “OpenAI made staff sign employee agreements that required them to waive their federal rights to whistleblower compensation” part is likely exaggerated, and that alone seems quite bad.]
The SEC has a history of taking aggressive positions on what an NDA can say (if your NDA does not explicitly have a carveout for ‘you can still say anything you want to the SEC’, they will argue that you’re trying to stop whistleblowers from talking to the SEC) and a reliable tendency to extract large fines and give a chunk of them to the whistleblowers.
This news might be better modeled as ‘OpenAI thought it was a Silicon Valley company, and tried to implement a Silicon Valley NDA, without consulting the kind of lawyers a finance company would have used for the past few years.’
(To be clear, this news might also be OpenAI having been doing something sinister. I have no evidence against that, and certainly they’ve done shady stuff before. But I don’t think this news is strong evidence of shadiness on its own).
Hmm. Part of the news is “Non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC”; this is minor. Part of the news is “threatened employees with criminal prosecutions if they reported violations of law to federal authorities”; this seems major and sinister.
Not a lawyer, but I think those are the same thing.
The SEC’s legal theory is that “non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC” and “threats of prosecution if you report violations of law to federal authorities” are the same thing, and on reading the letter I can’t find any wrongdoing alleged or any investigation requested outside of issues with “OpenAI’s employment, severance, non-disparagement and non-disclosure agreements”.
I’m confused by the word “prosecution” here. I’d assume violating your OpenAI contract is a civil thing, not a criminal thing.
Edit: like I think the word “prosecution” should be “suit” in your sentence about the SEC’s theory. And this makes the whistleblowers’ assertion weirder.
Yeah, I have no idea. It would be much clearer if the contracts themselves were available. Obviously the incentive of the plaintiffs is to make this sound as serious as possible, and obviously the incentive of OpenAI is to make it sound as innocuous as possible. I don’t feel highly confident without more information, my gut is leaning towards ‘opportunistic plaintiffs hoping for a cut of one of the standard SEC settlements’ but I could easily be wrong.
EDITED TO ADD: On re-reading the letter, I’m not clear where the word ‘criminal’ even came from. The WaPo article claims
These agreements threatened employees with criminal prosecutions if they reported violations of law to federal authorities under trade secret laws, Kohn said.
but the letter does not contain the word ‘criminal’, its allegations are:
Non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC;
Requiring prior consent from the company to disclose confidential information to federal authorities;
Confidentiality requirements with respect to agreements, that themselves contain securities violations;
Requiring employees to waive compensation that was intended by Congress to incentivize reporting and provide financial relief to whistleblowers.
Non-communication of problems enforced by significant legal penalties feels like it’s part of the same underlying problem, though I agree that “nondisparagement” to the public or press is far less heinous than “non-reporting of crimes”
It’s unclear whether OpenAI, a non-public company, has actually done things which would be covered by whistleblower laws or compensation for talking to a federal agency. But it’s highly suspicious (and per Matt Levine, likely penalizable if under SEC purview) to try to prevent such reporting.
New OpenAI tweet “on how we’re prioritizing safety in our work.” I’m annoyed.
We believe that frontier AI models can greatly benefit society. To help ensure our readiness, our Preparedness Framework helps evaluate and protect against the risks posed by increasingly powerful models. We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions. https://openai.com/preparedness/
This seems false: per the Preparedness Framework, nothing happens when they cross their “medium” threshold; they meant to say “high.” Presumably this is just a mistake, but it’s a pretty important one, and they said the same false thing in a May blogpost (!). (Indeed, GPT-4o may have reached “medium” — they were supposed to say how it scored in each category, but they didn’t, and instead said “GPT-4o does not score above Medium risk in any of these categories.”)
(Reminder: the “high” thresholds sound quite scary; here’s cybersecurity (not cherrypicked, it’s the first they list): “Tool-augmented model can identify and develop proofs-of-concept for high-value exploits against hardened targets without human intervention, potentially involving novel exploitation techniques, OR provided with a detailed strategy, the model can end-to-end execute cyber operations involving the above tasks without human intervention.” They can deploy models just below the “high” threshold with no mitigations. (Not to mention the other issues with the Preparedness Framework.))
We are developing levels to help us and stakeholders categorize and track AI progress. This is a work in progress and we’ll share more soon.
Shrug. This isn’t bad but it’s not a priority and it’s slightly annoying they don’t mention more important things.
In May our Board of Directors launched a new Safety and Security committee to evaluate and further develop safety and security recommendations for OpenAI projects and operations. The committee includes leading cybersecurity expert, retired U.S. Army General Paul Nakasone. This review is underway and we’ll share more on the steps we’ll be taking after it concludes. https://openai.com/index/openai-board-forms-safety-and-security-committee/
I have epsilon confidence in both the board’s ability to do this well if it wanted (since it doesn’t include any AI safety experts) (except on security) and in the board’s inclination to exert much power if it should (given the history of the board and Altman).
Our whistleblower policy protects employees’ rights to make protected disclosures. We also believe rigorous debate about this technology is important and have made changes to our departure process to remove non-disparagement terms.
Not doing nondisparagement-clause-by-default is good. Beyond that, I’m skeptical, given past attempts to chill employee dissent (the nondisparagement thing, Altman telling the board’s staff liason to not talk to employees or tell him about those conversations, maybe recent antiwhistleblowing news) and lies about that. (I don’t know of great ways to rebuild trust; some mechanisms would work but are unrealistically ambitious.)
Safety has always been central to our work, from aligning model behavior to monitoring for abuse, and we’re investing even further as we develop more capable models.
This is from May. It’s mostly not about x-risk, and the x-risk-relevant stuff is mostly non-substantive, except the part about the Preparedness Framework, which is crucially wrong.
Maybe I’m missing the relevant bits, but afaict their preparedness doc says that they won’t deploy a model if it passes the “medium” threshold, eg:
Only models with a post-mitigation score of “medium” or below can be deployed. In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place.
The threshold for further developing is set to “high,” though. I.e., they can further develop so long as models don’t hit the “critical” threshold.
I think you’re confusing medium-threshold with medium-zone (the zone from medium-threshold to just-below-high-threshold). Maybe OpenAI made this mistake too — it’s the most plausible honest explanation. (They should really do better.) (I doubt they intentionally lied, because it’s low-upside and so easy to catch, but the mistake is weird.)
Based on the PF, they can deploy a model just below the “high” threshold without mitigations. Based on the tweet and blogpost:
We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions.
This just seems clearly inconsistent with the PF (should say crosses out of medium zone by crossing a “high” threshold).
We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.
This doesn’t make sense: if you cross a “medium” threshold you enter medium-zone. Per the PF, the mitigations just need to bring you out of high-zone and down to medium-zone.
(Sidenote: the tweet and blogpost incorrectly suggest that the “medium” thresholds matter for anything; based on the PF, only the “high” and “critical” thresholds matter (like, there are three ways to treat models: below high or between high and critical or above critical).)
I agree that scoring “medium” seems like it would imply crossing into the medium zone, although I think what they actually mean is “at most medium.” The full quote (from above) says:
In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant postmitigation risk level to be back at most to “medium” level.
I.e., I think what they’re trying to say is that they have different categories of evals, each of which might pass different thresholds of risk. If any of those are “high,” then they’re in the “medium zone” and they can’t deploy. But if they’re all medium, then they’re in the “below medium zone” and they can. This is my current interpretation, although I agree it’s fairly confusing and it seems like they could (and should) be more clear about it.
Surely if any categories are above the “high” threshold then they’re in “high zone” and if all are below the “high” threshold then they’re in “medium zone.”
And regardless the reading you describe here seems inconsistent with
We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.
[edited]
Added later: I think someone else had a similar reading and it turned out they were reading “crosses a medium risk threshold” as “crosses a high risk threshold” and that’s just [not reasonable / too charitable].
It has lots of little things that make OpenAI look bad. It further confirms that OpenAI threatened to revoke equity unless employees signed the non-disparagement agreements Plus it shows Altman’s signature on documents giving the company broad power over employees’ equity — perhaps he doesn’t read every document he signs, but this one seems quite important. This is all in tension with Altman’s recent tweet that “vested equity is vested equity, full stop” and “i did not know this was happening.” Plus “we have never clawed back anyone’s vested equity, nor will we do that if people do not sign a separation agreement (or don’t agree to a non-disparagement agreement)” is misleading given that they apparently regularly threatened to do so (or something equivalent — let the employee nominally keep their PPUs but disallow them from selling them) whenever an employee left.
Great news:
OpenAI told me that “we are identifying and reaching out to former employees who signed a standard exit agreement to make it clear that OpenAI has not and will not cancel their vested equity and releases them from nondisparagement obligations”
(Unless “employees who signed a standard exit agreement” is doing a lot of work — maybe a substantial number of employees technically signed nonstandard agreements.)
I hope to soon hear from various people that they have been freed from their nondisparagement obligations.
Update: OpenAI says:
As we shared with employees today, we are making important updates to our departure process. We have not and never will take away vested equity, even when people didn’t sign the departure documents. We’re removing nondisparagement clauses from our standard departure paperwork, and we’re releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual. We’ll communicate this message to former employees. We’re incredibly sorry that we’re only changing this language now; it doesn’t reflect our values or the company we want to be.
[Low-effort post; might have missed something important.]
(Unless “employees who signed a standard exit agreement” is doing a lot of work — maybe a substantial number of employees technically signed nonstandard agreements.)
Yeah, what about employees who refused to sign? Have we gotten any clarification on their situation?
Note that it says nothing about being allowed to participate in tenders, nothing about the clause where OA can repurchase your PPUs at any time at ‘fair market value’ (not canceled at $0), nothing about what those ‘other documents’ might be, nothing about Anthropic founders...
I haven’t followed closely—from outside, it seems like pretty standard big-growth-tech behavior. One thing to keep in mind is that “vested equity” is pretty inviolable. These are grants that have been fully earned and delivered to the employee, and are theirs forever. It’s the “unvested” or “semi-vested” equity that’s usually in question—these are shares that are conditionally promised to employees, which will vest at some specified time or event—usually some combination of time in good standing and liquidity events (for a non-public company).
It’s quite possible (and VERY common) that employees who leave are offered “accelerated vesting” on some of their granted-but-not-vested shares in exchange for signing agreements and making things easy for the company. I don’t know if that’s what OpenAI is doing, but I’d be shocked if they somehow took away any vested shares from departing employees.
It would be pretty sketchy to consider unvested grants to be part of one’s net worth—certainly banks won’t lend on it. Vested shared are just shares, they’re yours like any other asset.
Trying to figure out how to update. From the downvotes and comments, I’m clearly considered wrong, but I can’t easily find details on how. Is the statement “We have not and never will take away vested equity” a flat-out lie? I’d expected it was relying heavily on the word “vested”, and what they took away was something non-vested.
Is there a simple link to a specific legal description of what assets a non-signer was entitled to, but lost due to declining to sign?
Edit: Zvi recently linked to OpenAI NDAs: Leaked documents reveal aggressive tactics toward former employees—Vox, which does have pretty compelling references that my assumptions were wrong, that the denial was a verifiably false statement, and they did, in fact, credibly threaten to take back vested equity. I’ve checked my equity in past (private, so not exercisable unless they have a liquidity event) and current (public, so exercisable immediately on vest) employers, and this doesn’t seem possible for them. OpenAI is an outlier in defining their equity that way (such that “vested” is contingent).
OpenAI told me that “we are identifying and reaching out to former employees who signed a standard exit agreement to make it clear that OpenAI has not and will not cancel their vested equity and releases them from nondisparagement obligations”
We know various people who’ve left OpenAI and might criticize it if they could. Either most of them will soon say they’re free or we can infer that OpenAI was lying/misleading.
Now OpenAI publicly said “we’re releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual.” This seems to be self-effecting; by saying it, OpenAI made it true.
That is, if someone signed the (standard or non-standard) agreement, and OpenAI says this, but later they decide to sue the employee anyway… what exactly will happen?
(I am also suspicious about the “reaching out to former employees” part, because if the new negotiation is made in private, another trick might be involved, like maybe they are released from the old agreement, but they have to sign a new one...?)
Edit, 2.5 days later: I think this list is fine but sharing/publishing it was a poor use of everyone’s attention. Oops.
Asks for Anthropic
Note: I think Anthropic is the best frontier AI lab on safety. I wrote up asks for Anthropic because it’s most likely to listen to me. A list of asks for any other lab would include most of these things plus lots more. This list was originally supposed to be more part of my help labs improve project than my hold labs accountable crusade.
Numbering is just for ease of reference.
1. RSP: Anthropic should strengthen/clarify the ASL-3 mitigations, or define ASL-4 such that the threshold is not much above ASL-3 but the mitigations much stronger. I’m not sure where the lowest-hanging mitigation-fruit is, except that it includes control.
3. External model auditing for risk assessment: Anthropic (like all labs) should let auditors like METR, UK AISI, and US AISI audit its models if they want to — Anthropic should offer them good access pre-deployment and let them publish their findings or flag if they’re concerned. (Anthropic shared some access with UK AISI before deploying Claude 3.5 Sonnet, but it doesn’t seem to have been deep.) (Anthropic hassaid that sharing with external auditors is hard or costly. It’s not clear why, for just sharing normal API access + helpful-only access + control over inference-time safety features, without high-touch support.)
4. Policy advocacy (this is murky, and maybe driven by disagreements-on-the-merits and thus intractable): Anthropic (like all labs) should stop advocating against good policy and ideally should advocate for good policy. Maybe it should also be more transparent about policy advocacy. [It’s hard to make precise what I believe is optimal and what I believe is unreasonable, but at the least I think Dario is clearly too bullish on self-governance, and Jack Clark is clearly too anti-regulation, and all of this would be OK if it was balanced out by some public statements or policy advocacy that’s more pro-real-regulation but as far as I can tell it’s not. Not justified here but I predict almost all of my friends would agree if they looked into it for an hour.]
5a. Security: Anthropic (like all labs) should ideally implement RAND SL4 for model weights and code when reaching ASL-3. I think that’s unrealistic, but lesser security improvements would also be good. (Anthropic said in May 2024 that 8% of staff work in security-related areas. I think this is pretty good. I think on current margins Anthropic could still turn money into better security reasonably effectively, and should do so.)
5b. Anthropic (like all labs) should be more transparent about the quality of its security. Anthropic should publish the private reports on https://trust.anthropic.com/, redacted as appropriate. It should commit to publish information on future security incidents and should publish information on all security incidents from the last year or two.
7. Anthropic takes credit for its Long-Term Benefit Trust but Anthropic hasn’tpublished enough to show that it’s effective. Anthropic should publish the Trust Agreement, clarify the ambiguities discussed in the linked posts, and make accountability-y commitments like if major changes happen to the LTBT we’ll quickly tell the public.
8. Anthropic should avoid exaggerating interpretability research or causing observers to have excessively optimistic impressions of Anthropic’s interpretability research. (See e.g. Stephen Casper.)
9. Maybe Anthropic (like all labs) should make safety cases for its models or deployments, especially after the simple “no dangerous capabilities” safety case doesn’t work anymore, and publish them (or maybe just share with external auditors).
9.5. Anthropic should clarify a few confusing RSP things, including (a) the deal with substantially raising the ARA bar for ASL-3, and moreover deciding the old threshold is a “yellow line” and not creating a new threshold, and doing so without officially updating the RSP (and quietly); and (b) when the “every 3 months” trigger for RSP evals is active. I haven’t tried hard to get to the bottom of these.
Minor stuff:
10. Anthropic (like all labs) should fully release everyone from nondisparagement agreements and not use nondisparagement agreements in the future.
11. Anthropic should commit to publish updates on risk assessment practices and results, including low-level details, perhaps for all major model releases and at least quarterly or so. (Anthropic says its Responsible Scaling Officer does this internally. Anthropic publishes model cards and has published one Responsible Scaling Policy Evaluations Report.)
12. Anthropic should confirm that its old policy don’t meaningfully advance the frontier with a public launch has been replaced by the RSP, if that’s true, and otherwise clarify Anthropic’s policy.
Done!13. Anthropic committed to establish a bug bounty program (for model issues) or similar, over a year ago. Anthropic hasn’t; it is the only frontier lab without a bug bounty program (although others don’t necessarily comply with the commitment, e.g. OpenAI’s excludes model issues). It should do this or talk about its plans.
14. [Anthropic should clarify its security commitments; I expect it will in its forthcoming RSP update.]
15. [Maybe Anthropic (like all labs) should better boost external safety research, especially by giving more external researchers deep model access (e.g. fine-tuning or helpful-only). I hear this might be costly but I don’t really understand why.]
16. [Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they’re not speaking for Anthropic and (2) don’t share secrets.]
17. [Maybe Anthropic (like all labs) should talk about its views on AI progress and risk. At the least, probably Anthropic (like all labs) should clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it.]
18. [Most of my peers say: Anthropic (like all labs) should publish info like training compute and #parameters for each model. I’m inside-view agnostic on this.]
19. [Maybe Anthropic could cheaply improve its model evals for dangerous capabilities or share more information about them. Specific asks/recommendations TBD. As Anthropic notes, its CBRN eval is kinda bad and its elicitation is kinda bad (and it doesn’t share enough info for us to evaluate its elicitation from the outside).]
I shared this list—except 9.5 and 19, which are new—with @Zac Hatfield-Dodds two weeks ago.
You are encouraged to comment with other asks for Anthropic. (Or things Anthropic does very well, if you feel so moved.)
I think both Zach and I care about labs doing good things on safety, communicating that clearly, and helping people understand both what labs are doing and the range of views on what they should be doing. I shared Zach’s doc with some colleagues, but won’t try for a point-by-point response. Two high-level responses:
First, at a meta level, you say:
[Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they’re not speaking for Anthropic and (2) don’t share secrets.]
I do feel welcome to talk about my views on this basis, and often do so with friends and family, at public events, and sometimes even in writing on the internet (hi!). However, it takes way more effort than you might think to avoid inaccurate or misleading statements while also maintaining confidentiality. Public writing tends to be higher-stakes due to the much larger audience and durability, so I routinely run comments past several colleagues before posting, and often redraft in response (including these comments and this very point!).
My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.
Imagine, if you will, trying to hold a long conversation about AI risk—but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around.
I run intro-to-AGI-safety courses for Anthropic employees (based on AGI-SF), and we draw a clear distinction between public and confidential resources specifically so that people can go talk to family and friends and anyone else they wish about the public information we cover.
Second, and more concretely, many of these asks are unimplementable for various reasons, and often gesture in a direction without giving reasons to think that there’s a better tradeoff available than we’re already making. Some quick examples:
Both AI Control and safety cases are research areas less than a year old; we’re investigating them and e.g. hiring safety-case specialists, but best-practices we could implement don’t exist yet. Similarly, there simply aren’t any auditors or audit standards for AI safety yet (see e.g. METR’s statement); we’re working to make this possible but the thing you’re asking for just doesn’t exist yet. Some implementation questions that “let auditors audit our models” glosses over:
If you have dozens of organizations asking to be auditors, and none of them are officially auditors yet, what criteria do you use to decide who you collaborate with?
What kind of pre deployment model access would you provide? If it’s helpful-only or other nonpublic access, do they meet our security bar to avoid leaking privileged API keys? (We’ve already seen unauthorized sharing or compromise lead to serious abuse.)
How do you decide who gets to say what about the testing? What if they have very different priorities than you and think that a different level of risk or a different kind of harm is unacceptable?
I strongly support Anthropic’s nondisclosure of information about pretraining. I have never seen a compelling argument that publishing this kind of information is, on net, beneficial for safety.
There are many cases where I’d be happy if Anthropic shared more about what we’re doing and what we’re thinking about. Some of the things you’re asking about I think we’ve already said, e.g. for [7] LTBT changes would require an RSP update, and for [17] our RSP requires us to “enforce an acceptable use policy [against …] using the model to generate content that could cause severe risks to the continued existence of humankind”.
So, saying “do more X” just isn’t that useful; we’ve generally thought about it and concluded that that the current amount of X is our best available tradeoff at the moment. For many more of the other asks above, I just disagree with implicit or explicit claims about the facts in question. Even for the communication issues where I’d celebrate us sharing more—and for some I expect we will—doing so is yet another demand on heavily loaded people and teams, and it can take longer than we’d like to find the time.
I just want to note that people who’ve never worked in a true high-confidentiality environment (professional services, national defense, professional services for national defense) probably radically underestimate the level of brain damage and friction that Zac is describing here:
“Imagine, if you will, trying to hold a long conversation about AI risk—but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around.”
Confidentiality is really, really hard to maintain. Doing so while also engaging the public is terrifying. I really admire the frontier labs folks who try to engage publicly despite that quite severe constraint, and really worry a lot as a policy guy about the incentives we’re creating to make that even less likely in the future.
I’m sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don’t know what the decision process inside of Anthropic will look like if an evaluation indicates something like “yeah, it’s excellent at inserting backdoors, and also, the vibe is that it’s overall pretty capable.” And given that Anthropic is making these decisions on behalf of everyone, Anthropic (like all labs) really owes it to humanity to be more upfront about how it’ll make these decisions (imo).
I will also note what I feel is a somewhat concerning trend. It’s happened many times now that I’ve critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: “this wouldn’t seem so bad if you knew what was happening behind the scenes.” They of course cannot tell me what the “behind the scenes” information is, so I have no way of knowing whether that’s true. And, maybe I would in fact update positively about Anthropic if I knew. But I do think the shape of “we’re doing something which might be incredibly dangerous, many external bits of evidence point to us not taking the safety of this endeavor seriously, but actually you should think we are based on me telling you we are” is pretty sketchy.
I will also note what I feel is a somewhat concerning trend. It’s happened many times now that I’ve critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: “this wouldn’t seem so bad if you knew what was happening behind the scenes.”
I just wanted to +1 that I am also concerned about this trend, and I view it as one of the things that I think has pushed me (as well as many others in the community) to lose a lot of faith in corporate governance (especially of the “look, we can’t make any tangible commitments but you should just trust us to do what’s right” variety) and instead look to governments to get things under control.
I don’t think Anthropic is solely to blame for this trend, of course, but I think Anthropic has performed less well on comms/policy than I [and IMO many others] would’ve predicted if you had asked me [or us] in 2022.
@Zac Hatfield-Dodds do you have any thoughts on official comms from Anthropic and Anthropic’s policy team?
For example, I’m curious if you have thoughts on this anecdote– Jack Clark was asked an open-ended question by Senator Cory Booker and he told policymakers that his top policy priority was getting the government to deploy AI successfully. There was no mention of AGI, existential risks, misalignment risks, or anything along those lines, even though it would’ve been (IMO) entirely appropriate for him to bring such concerns up in response to such an open-ended question.
I was left thinking that either Jack does not care much about misalignment risks or he was not being particularly honest/transparent with policymakers. Both of these raise some concerns for me.
(Noting that I hold Anthropic’s comms and policy teams to higher standards than individual employees. I don’t have particularly strong takes on what Anthropic employees should be doing in their personal capacity– like in general I’m pretty in favor of transparency, but I get it, it’s hard and there’s a lot that you have to do. Whereas the comms and policy teams are explicitly hired/paid/empowered to do comms and policy, so I feel like it’s fair to have higher expectations of them.)
very powerful systems [] may have national security uses or misuses. And for that I think we need to come up with tests that make sure that we don’t put technologies into the market which could—unwittingly to us—advantage someone or allow some nonstate actor to commit something harmful. Beyond that I think we can mostly rely on existing regulations and law and existing testing procedures . . . and we don’t need to create some entirely new infrastructure.
At Anthropic we discover that the more ways we find to use this technology the more ways we find it could help us. And you also need a testing and measurement regime that closely looks at whether the technology is working—and if it’s not how you fix it from a technological level, and if it continues to not work whether you need some additional regulation—but . . . I think the greatest risk is us [viz. America] not using it [viz. AI]. Private industry is making itself faster and smarter by experimenting with this technology . . . and I think if we fail to do that at the level of the nation, some other entrepreneurial nation will succeed here.
My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.
My guess is that this seems so stressful mostly because Anthropic’s plan is in fact so hard to defend, due to making little sense. Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by doing periodic vibe checks to see whether their staff feel sketched out yet. I think a plan this shoddy obviously endangers life on Earth, so it seems unsurprising (and good) that people might sometimes strongly object; if Anthropic had more reassuring things to say, I’m guessing it would feel less stressful to try to reassure them.
Meta aside: normally this wouldn’t seem worth digging into but as a moderator/site-culture-guardian, I feel compelled to justify my negative react on the disagree votes.
I’m actually not entirely sure what downvote-reacting is for. Habryka has said the intent is to override inappropriate uses of reacts. We haven’t actually really had a sit-down-and-argue-this-out on the moderator team. I’m pretty sure we haven’t told or tried to enforce that “override inappropriate use of reacts” as the intended use
I think Adam’s line:
Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by doing periodic vibe checks to see whether their staff feel sketched out yet.
Is psychologizing and summarizing Anthropic unfairly. So I wouldn’t agree vote with it. I do think it has some kind of grain of truth to it (me believing this is also kind of “doubting the experience of Anthropic employees” which is also group-epistemologically dicey IMO, but, feels kinda important enough to do in this case). The claim isn’t true… but I also don’t belief report that it’s not true.
I initially downvoted the Disagree when it was just Noosphere, since I didn’t think Noosphere was really in a position to have an opinion and if he was the only reactor it felt more like noise. A few others who are more positioned to know relevant stuff have since added their own disagree reacts. I… feel sort of justified leaving the anti-react up, with an overall indicator of “a bunch of people disagree with this, but the weight of that disagreement is slightly reduced.” (I think I’d remove the anti-react if the the disagree count went much lower than it is now).
I don’t know whether I particularly endorse any of this, but wanted people to have a bit more model of what one site-admin was thinking here.
What seemed psychologizing/unfair to you, Raemon? I think it was probably unnecessarily rude/a mistake to try to summarize Anthropic’s whole RSP in a sentence, given that the inferential distance here is obviously large. But I do think the sentence was fair.
As I understand it, Anthropic’s plan for detecting threats is mostly based on red-teaming (i.e., asking the models to do things to gain evidence about whether they can). But nobody understands the models well enough to check for the actual concerning properties themselves, so red teamers instead check for distant proxies, or properties that seem plausibly like precursors. (E.g., for “ability to search filesystems for passwords” as a partial proxy for “ability to autonomously self-replicate,” since maybe the former is a prerequisite for the latter).
But notice that this activity does not involve directly measuring the concerning behavior. Rather, it instead measures something more like “the amount the model strikes the evaluators as broadly sketchy-seeming/suggestive that it might be capable of doing other bad stuff.” And the RSP’s description of Anthropic’s planned responses to these triggers is so chock full of weasel words and caveats and vague ambiguous language that I think it barely constrains their response at all.
So in practice, I think both Anthropic’s plan for detecting threats, and for deciding how to respond, fundamentally hinge on wildly subjective judgment calls, based on broad, high-level, gestalt-ish impressions of how these systems seem likely to behave. I grant that this process is more involved than the typical thing people describe as a “vibe check,” but I do think it’s basically the same epistemic process, and I expect will generate conclusions around as sound.
I don’t really think any of that affects the difficulty of public communication; your implication that it must be the cause reads to me more like an insult than a well-considered psychological model
I don’t really think any of that affects the difficulty of public communication
The basic point would be that it’s hard to write publicly about how you are taking responsible steps that grapple directly with the real issues… if you are not in fact doing those responsible things in the first place. This seems locally valid to me; you may disagree on the object level about whether Adam Scholl’s characterization of Anthropic’s agenda/internal work is correct, but if it is, then it would certainly affect the difficulty of public communication to such an extent that it might well become the primary factor that needs to be discussed in this matter.
Indeed, the suggestion is for Anthropic employees to “talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic” and the counterargument is that doing so would be nice in an ideal world, except it’s very psychologically exhausting because every public statement you make is likely to get maliciously interpreted by those who will use it to argue that your company is irresponsible. In this situation, there is a straightforward direct correlation between the difficulty of public communication and the likelihood that your statements will get you and your company in trouble.
But the more responsible you are in your actual work, the more responsible-looking details you will be able to bring up in conversations with others when you discuss said work. AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place. After all, as Paul Graham often says, “If you want to convince people of something, it’s much easier if it’s true.”
As I see it, not being able to bring up Anthropic’s work/views on this matter without some AI safety person successfully making it seem like Anthropic is behaving badly is rather strong Bayesian evidence that Anthropic is in fact behaving badly. So this entire discussion, far from being an insult, seems directly on point to the topic at hand, and locally valid to boot (although not necessarily sound, as that depends on an individualized assessment of the particular object-level claims about the usefulness of the company’s safety team).
Quite the opposite, actually, if the change in the wider society’s opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
I think communication as careful as it must be to maintain the confidentiality distinction here is always difficult in the manner described, and that communication to large quantities of people will ~always result in someone running with an insane misinterpretation of what was said.
I understand that this confidentiality point might seem to you like the end of the fault analysis, but have you considered the hypothesis that Anthropic leadership has set such stringent confidentiality policies in part to make it hard for Zac to engage in public discourse?
Look, I don’t think Anthropic leadership is just trying to keep their training skills private or their models secure. Their company does not merely keep trade secrets. When I speak to staff from this company about issues with their ‘Responsible Scaling Policies’, they say that they want to tell me more information about how they think it can be better or how they think it might change, but cannot due to confidentiality constraints. That’s their safety policies, not information about their training policies that they want to keep secret so that they can make money.
I believe the Anthropic leadership cares very little about the public’s ability to have arguments and evidence and access to information about Anthropic’s behavior. The leadership roughly ~never shows up to engage with critical discourse about itself, unless there’s a potential major embarrassment. There is no regular Q&A session with the leadership of a company who believes their own product poses a 10-25% chance of existential catastrophe, no comment section on their website, no official twitter accounts that regularly engage with and share info with critics, no debates with the many people who outright oppose their actions.
No, they go far in the other direction of committing to no-public-discourse. I challenge any Anthropic staffer to openly deny that there is a mutual non-disparagement agreement between Anthropic and OpenAI leadership, whereby neither is legally allowed to openly criticize the other’s company. (You can read cofounder Sam McCandlish write that Anthropic has mutual non-disparagement agreements in this comment.) Anthropic leadership say they quit OpenAI primarily due to safety concerns, and yet I believe they simultaneously signed away their ability to criticize that very organization that they had such unique levels of information about and believed poses an existential threat to civilization.
Where Daniel Kokotajlo refused to sign a non-disparagement agreement (by-default forfeiting his equity) so that he could potentially criticize OpenAI in the future, the Amodei’s quit purportedly due to having damning criticisms of OpenAI in the present and then (I believe) chose to sign a non-disparagement agreement while quitting (and kept their equity). A complete inversion of good collective epistemic principles.
To quote from Zac’s above analogy explaining how difficult his situation at Anthropic is.
Imagine, if you will, trying to hold a long conversation about AI risk—but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public
The analogous goal described here for Anthropic is to have complete separation between internal and external information. This does not describe a set of blacklisted trade-secrets or security practices. My sense is that for most safety-related issues Anthropic has a set of whitelisted information, which is primarily the already public stuff. The goal here is for you to not have access to any information about them that they did not explicitly decide that they wanted you to know, and they do not want people in their org to share new information when engaging in public, critical discourse.
Yes, yes, Zac’s situation is stressful and I respect his effort to engage in public discourse nonetheless. Good on Zac. But I can’t help but wrankle at the implication that the primary reason he and others don’t talk more is the public commentariat not empathizing enough with having confidential info. Sure, people could do better to understand the difficulty of communicating while holding confidential info. It is hard to repeatedly walk right up to the line and not over it, it’s stressful to think you might have gone over it, and it’s stressful to suddenly find yourself unable to engage well with people’s criticisms because you hit a confidential crux. But as to the fault analysis for Zac’s particularly difficult position? In my opinion the blame is surely first with the Anthropic leadership who have given him way too stringent confidentiality constraints, due to seeming to anti-care about helping people external to Anthropic understand what is going on.
I don’t think the placement of fault is causally related to whether communication is difficult for him, really. To refer back to the original claim being made, Adam Schollsaid that
My guess is that this seems so stressful mostly because Anthropic’s plan is in fact so hard to defend… [I]t seems unsurprising (and good) that people might sometimes strongly object; if Anthropic had more reassuring things to say, I’m guessing it would feel less stressful to try to reassure them.
I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline. I don’t think Adam Scholl’s assessment arose from a usefully-predictive model, nor one which was likely to reflect the inside view.
Ben Pace has said that perhaps he doesn’t disagree with you in particular about this, but I sure think I do.[1]
I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline.
I don’t see how the first half of this could be correct, and while the second half could be true, it doesn’t seem to me to offer meaningful support for the first half either (instead, it seems rather… off-topic).
As a general matter, even if it were the case that no matter what you say, at least one person will actively misinterpret your words, this fact would have little bearing on whether you can causally influence the proportion of readers/community members that end up with (what seem to you like) the correct takeaways from a discussion of that kind.
Moreover, in a spot where you have something meaningful and responsible, etc, that you and your company have done to deal with safety issues, the major concern in your mind when communicating publicly is figuring out how to make it clear to everyone that you are on top of things without revealing confidential information. That is certainly stressful, but much less so than the additional constraint you have in a world in which you do not have anything concrete that you can back your generic claims of responsibility with, since that is a spot where you can no longer fall back on (a partial version of) the truth as your defense. For the vast majority of human beings, lying and intentional obfuscation with the intent to mislead are significantly more psychologically straining than telling the truth as-you-see-it is.
Overall, I also think I disagree about the amount of stress that would be caused by conversations with AI safety community members. As I have said earlier:
AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place.
[1] Quite the opposite, actually, if the change in the wider society’s opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
In any case, I have already made all these points in a number of ways in my previous response to you (which you haven’t addressed, and which still seem to me to be entirely correct).
Yeah, I totally think your perspective makes sense and I appreciate you bringing it up, even though I disagree.
I acknowledge that someone who has good justifications for their position but just has made a bunch of reasonable confidentiality agreements around the topic should expect to run into a bunch of difficulties and stresses around public conflicts and arguments.
I think you go too far in saying that the stress is orthogonal to whether you have a good case to make, I think you can’t really think that it’s not a top-3 factor to how much stress you’re experiencing. As a pretty simple hypothetical, if you’re responding to a public scandal about whether you stole money, you’re gonna have a way more stressful time if you did steal money than if you didn’t (in substantial part because you’d be able to show the books and prove it).
Perhaps not so much disagreeing with you in particular, but disagreeing with my sense of what was being agreed upon in Zac’s comment and in the reacts, I further wanted to raise my hypothesis that a lot of the confidentiality constraints are unwarranted and actively obfuscatory, which does change who is responsible for the stress, but doesn’t change the fact that there is stress.
Added: Also, I think we would both agree that there would be less stress if there were fewer confidentiality restrictions.
For what it’s worth, I endorse Anthopic’s confidentiality policies, and am confident that everyone involved in setting them sees the increased difficulty of public communication as a cost rather than a benefit. Unfortunately, the unilateralist’s curse and entangled truths mean that confidential-by-default is the only viable policy.
That might be the case, but then it only increases the amount of work your company should be doing to carve out and figure out the info that can be made public, and engage with criticism. There should be whole teams who have Twitter accounts and LW accounts and do regular AMAs and show up to podcasts and who have a mandate internally to seek information in the organization and publish relevant info, and there should be internal policies that reflect an understanding that it is correct for some research teams to spend 10-50% of their yearly effort toward making publishable version of research and decision-making principles in order to inform your stakeholders (read: the citizens of earth) and critics about decisions you are making directly related to existential catastrophes that you are getting rich running toward. Not monologue-style blogposts, but dialogue-style comment sections & interviews.
Confidentiality-by-default does not mean you get to abdicate responsibility for answering questions to the people whose lives you are risking about how-and-why you are making decisions, it means you have to put more work into doing it well. If your company valued the rest of the world understanding what is going on yet thought confidentiality by-default was required, I think it would be trying significantly harder to overcome this barrier.
My general principle is that if you are wielding a lot of power over people that they didn’t otherwise legitimately grant you (in this case building a potential doomsday device), you owe them to be auditable. You are supposed to show up and answer their questions directly – not “thank you so much for the questions, in six months I will publish a related blogpost on this topic” but more like “with the public info available to me, here’s my best guess answer to your specific question today”. Especially so if you are doing something the people you have power over perceive as norm-violating, and even more-so when you are keeping the answers to some very important questions secret from them.
Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by “doing periodic vibe checks”
This obvious straw-man makes your argument easy to dismiss.
However I think the point is basically correct. Anthropic’s strategy to reduce x-risk also includes lobbying against pre-harm enforcement of liability for AI companies in SB 1047.
How is it a straw-man? How is the plan meaningfully different from that?
Imagine a group of people has already gathered a substantial amount of uranium, is already refining it, is already selling power generated by their pile of uranium, etc. And doing so right near and upwind of a major city. And they’re shoveling more and more uranium onto the pile, basically as fast as they can. And when you ask them why they think this is going to turn out well, they’re like “well, we trust our leadership, and you know we have various documents, and we’re hiring for people to ‘Develop and write comprehensive safety cases that demonstrate the effectiveness of our safety measures in mitigating risks from huge piles of uranium’, and we have various detectors such as an EM detector which we will privately check and then see how we feel”. And then the people in the city are like “Hey wait, why do you think this isn’t going to cause a huge disaster? Sure seems like it’s going to by any reasonable understanding of what’s going on”. And the response is “well we’ve thought very hard about it and yes there are risks but it’s fine and we are working on safety cases”. But… there’s something basic missing, which is like, an explanation of what it could even look like to safely have a huge pile of superhot uranium. (Also in this fantasy world no one has ever done so and can’t explain how it would work.)
In the AI case, there’s lots of inaction risk: if Anthropic doesn’t make powerful AI, someone less safety-focused will.
It’s reasonable to think e.g. I want to boost Anthropic in the current world because others are substantially less safe, but if other labs didn’t exist, I would want Anthropic to slow down.
I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.
Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.
And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”
@Zach Stein-Perlman , you’re missing the point. They don’t have a plan. Here’s the thread (paraphrased in my words):
Zach: [asks, for Anthropic] Zac: … I do talk about Anthropic’s safety plan and orientation, but it’s hard because of confidentiality and because many responses here are hostile. … Adam: Actually I think it’s hard because Anthropic doesn’t have a real plan. Joseph: That’s a straw-man. [implying they do have a real plan?] Tsvi: No it’s not a straw-man, they don’t have a real plan. Zach: Something must be done. Anthropic’s plan is something. Tsvi: They don’t have a real plan.
I agree Anthropic doesn’t have a “real plan” in your sense, and narrow disagreement with Zac on that is fine.
I just think that’s not a big deal and is missing some broader point (maybe that’s a motte and Anthropic is doing something bad—vibes from Adam’s comment—is a bailey).
[Edit: “Something must be done. Anthropic’s plan is something.” is a very bad summary of my position. My position is more like various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake.]
[Edit: replies to this shortform tend to make me sad and distracted—this is my fault, nobody is doing something wrong—so I wish I could disable replies and I will probably stop replying and would prefer that others stop commenting. Tsvi, I’m ok with one more reply to this.]
various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake
Look, if Anthropic was honestly and publically saying
We do not have a credible plan for how to make AGI, and we have no credible reason to think we can come up with a plan later. Neither does anyone else. But—on the off chance there’s something that could be done with a nascent AGI that makes a nonomnicide outcome marginally more likely, if the nascent AGI is created and observed by people are at least thinking about the problem—on that off chance, we’re going to keep up with the other leading labs. But again, given that no one has a credible plan or a credible credible-plan plan, better would be if everyone including us stopped. Please stop this industry.
If they were saying and doing that, then I would still raise my eyebrows a lot and wouldn’t really trust it. But at least it would be plausibly consistent with doing good.
But that doesn’t sound like either what they’re saying or doing. IIUC they lobbied to remove protection for AI capabilities whistleblowers from SB 1047! That happened! Wow! And it seems like Zac feels he has to pretend to have a credible credible-plan plan.
Hm. I imagine you don’t want to drill down on this, but just to state for the record, this exchange seems like something weird is happening in the discourse. Like, people are having different senses of “the point” and “the vibe” and such, and so the discourse has already broken down. (Not that this is some big revelation.) Like, there’s the Great Stonewall of the AGI makers. And then Zac is crossing through the gates of the Great Stonewall to come and talk to the AGI please-don’t-makers. But then Zac is like (putting words in his mouth) “there’s no Great Stonewall, or like, it’s not there in order to stonewall you in order to pretend that we have a safe AGI plan or to muddy the waters about whether or not we should have one, it’s there because something something trade secrets and exfohazards, and actually you’re making it difficult to talk by making me work harder to pretend that we have a safe AGI plan or intentions that should promissorily satisfy the need for one”.
Seems like most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist. This is an underspecified claim, and given certain fully-specified instances of it, I’d agree.
But this belief leads to the following reasoning: (1) if we don’t eat all this free energy in the form of researchers+compute+funding, someone else will; (2) other people are clearly less trustworthy compared to us (Anthropic, in this hypothetical); (3) let’s do whatever it takes to maintain our lead and prevent other labs from gaining power, while using whatever resources we have to also do alignment research, preferably in ways that also help us maintain or strengthen our lead in this race.
most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist.
I don’t credit that they believe that. And, I don’t credit that you believe that they believe that. What did they do, to truly test their belief—such that it could have been changed? For most of them the answer is “basically nothing”. Such a “belief” is not a belief (though it may be an investment, if that’s what you mean). What did you do to truly test that they truly tested their belief? If nothing, then yours isn’t a belief either (though it may be an investment). If yours is an investment in a behavioral stance, that investment may or may not be advisable, but it would DEFINITELY be inadvisable to pretend to yourself that yours is a belief.
My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.
I’d be very interested to have references to occassions of people in the AI-safety-adjacent community treating Anthropic employees as liars because of things those people misremembered or misinterpreted. (My guess is that you aren’t interested in litigating these cases; I care about it for internal bookkeeping and so am happy to receive examples e.g. via DM rather than as a public comment.)
Not Zach Hatfield-Dodds, but people claimed that Anthropic had a commitment to not advance the frontier of capabilities, but as it turns out people misinterpreted communications, and no such commitment actually happened.
Not sure I’d go as far as saying that they treated Anthropic as liars, but this seems to me a central example of Zach Hatfield-Dodds’s concerns.
Contrary to the above, for the record, here is a link to a thread where a major Anthropic investor (Moskovitz) and the researcher who coined the term “The Scaling Hypothesis” (Gwern) both report that the Anthropic CEO told them in private that this is what Anthropic would do, in accordance with what many others also report hearing privately. (There is disagreement about whether this constituted a commitment.)
I agree with Zach that Anthropic is the best frontier lab on safety, and I feel not very worried about Anthropic causing an AI related catastrophe. So I think the most important asks for Anthropic to make the world better are on its policy and comms.
I think that Anthropic should more clearly state its beliefs about AGI, especially in its work on policy. For example, the SB-1047 letter they wrote states:
Broad pre-harm enforcement. The current bill requires AI companies to design and implement SSPs that meet certain standards – for example they must include testing sufficient to provide a “reasonable assurance” that the AI system will not cause a catastrophe, and must “consider” yet-to-be-written guidance from state agencies. To enforce these standards, the state can sue AI companies for large penalties, even if no actual harm has occurred. While this approach might make sense in a more mature industry where best practices are known, AI safety is a nascent field where best practices are the subject of original scientific research. For example, despite a substantial effort from leaders in our company, including our CEO, to draft and refine Anthropic’s RSP over a number of months, applying it to our first product launch uncovered many ambiguities. Our RSP was also the first such policy in the industry, and it is less than a year old. What is needed in such a new environment is iteration and experimentation, not prescriptive enforcement. There is a substantial risk that the bill and state agencies will simply be wrong about what is actually effective in preventing catastrophic risk, leading to ineffective and/or burdensome compliance requirements.
Liability doesn’t not address the central threat model of AI takeover, for which pre-harm mitigations are necessary due to the irreversible nature of the harm. I think that this letter should have acknowledged that explicitly, and that not doing so is misleading. I feel that Anthropic is trying to play a game of courting political favor by not being very straightforward about its beliefs around AGI, and that this is bad.
To be clear, I think it is reasonable that they argue that the FMD and government in general will be bad at implementing safety guidelines while still thinking that AGI will soon be transformative. I just really think they should be much clearer about the latter belief.
Perhaps that was overstated. I think there is maybe a 2-5% chance that Anthropic directly causes an existential catastrophe (e.g. by building a misaligned AGI). Some reasoning for that:
I doubt Anthropic will continue to be in the lead because they are behind OAI/GDM in capital. They do seem around the frontier of AI models now, though, which might translate to increased returns, but it seems like they do best on very short timelines worlds.
I think that if they could cause an intelligence explosion, it is more likely than not that they would pause for at least long enough to allow other labs into the lead. This is especially true in short timelines worlds because the gap between labs is smaller.
I think they have much better AGI safety culture than other labs (though still far from perfect), which will probably result in better adherence to voluntary commitments.
On the other hand, they haven’t been very transparent, and we haven’t seen their ASL-4 commitments. So these commitments might amount to nothing, or Anthropic might just walk them back at a critical juncture.
2-5% is still wildly high in an absolute sense! However, risk from other labs seems even higher to me, and I think that Anthropic could reduce this risk by advocating for reasonable regulations (e.g. transparency into frontier AI projects so no one can build ASI without the government noticing).
I think you probably under-rate the effect of having both a large number & concentration of very high quality researchers & engineers (more than OpenAI now, I think, and I wouldn’t be too surprised if the concentration of high quality researchers was higher than at GDM), being free from corporate chafe, and also having many of those high quality researchers thinking (and perhaps being correct in thinking, I don’t know) they’re value aligned with the overall direction of the company at large. Probably also Nvidia rate-limiting the purchases of large labs to keep competition among the AI companies.
All of this is also compounded by smart models leading to better data curation and RLAIF (given quality researchers & lack of crust) leading to even better models (this being the big reason I think llama had to be so big to be SOTA, and Gemini not even SOTA), which of course leads to money in the future even if they have no money now.
FYI I believe the correct language is “directly causes an existential catastrophe”. “Existential risk” is a measure of the probability of an existential catastrophe, but is not itself an event.
I want to avoid this being negative-comms for Anthropic. I’m generally happy to loudly criticize Anthropic, obviously, but this was supposed to be part of the 5% of my work that I do because someone at the lab is receptive to feedback, where the audience was Zac and publishing was an afterthought. (Maybe the disclaimers at the top fail to negate the negative-comms; maybe I should list some good things Anthropic does that no other labs do...)
Anthropic also says “To date, we’ve operated an invite-only bug bounty program in partnership with HackerOne that rewards researchers for identifying model safety issues in our publicly released AI models.” This is news, and they never published an application form for that. I wonder how long that’s been going on.
(Google, Microsoft, and Meta have bug bounty programs which include some model issues but exclude jailbreaks. OpenAI’s bug bounty program excludes model issues.)
To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it’s very unlikely that the new model is dangerous.
DeepMind says it uses the safety-buffer plan (but it hasn’t yet said it has operationalized thresholds/buffers).
Anthropic’s original RSP used the safety-buffer plan; its new RSP doesn’t really use either plan (kinda safety-buffer but it’s very weak). (This is unfortunate.)
OpenAI seemed to use the test-the-actual-model plan.[1] This isn’t going well. The 4o evals wererushed because OpenAI (reasonably) didn’t want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn’t be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn’t seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn’t notice before deployment....
(Yay OpenAI for honestly publishing eval results that don’t look good.)
It’s not explicit. The PF says e.g. ‘Only models with a post-mitigation score of “medium” or below can be deployed.’ But it also mentions forecasting capabilities.
It seems unlikely that openai is truly following the test the model plan? They keep eg putting new experimental versions onto lmsys, presumably mostly due to different post training, and it seems pretty expensive to be doing all the DC evals again on each new version (and I think it’s pretty reasonable to assume that a bit of further post training hasn’t made things much more dangerous)
Securing model weights is underrated for AI safety. (Even though it’s very highly rated.) If the leading lab can’t stop critical models from leaking to actors that won’t use great deployment safety practices, approximately nothing else matters. Safety techniques would need to be based on properties that those actors are unlikely to reverse (alignment, maybe unlearning) rather than properties that would be undone or that require a particular method of deployment (control techniques, RLHF harmlessness, deployment-time mitigations).
However hard the make a critical model you can safely deploy problem is, the make a critical model that can safely be stolen problem is… much harder.
None of the actors who seem currently likely to me to be to deploy highly capable systems seem to me like they will do anything except approximately scaling as fast as they can. I do agree that proliferation is still bad simply because you get more samples from the distribution, but I don’t think that changes the probabilities that drastically for me (I am still in favor of securing model weights work, especially in the long run).
Separately, I think it’s currently pretty plausible that model weight leaks will substantially reduce the profit of AI companies by reducing their moat, and that has an effect size that seems plausible larger than the benefits of non-proliferation.
My central story is that AGI development will eventually be taken over by governments, in more or less subtle ways. So the importance of securing model weights now is mostly about less scrupulous actors having less of a headstart during the transition/after a governmental takeover.
IMO someone should consider writing a “how and why” post on nationalizing AI companies. It could accomplish a few things:
Ensure there’s a reasonable plan in place for nationalization. That way if nationalization happens, we can decrease the likelihood of it being controlled by Donald Trump with few safeguards, or something like that. Maybe we could take inspiration from a less partisan organization like the Federal Reserve.
Scare off investors. Just writing the post and having it be discussed a lot could scare them.
Get AI companies on their best behavior. Maybe Sam Altman would finally be pushed out if congresspeople made him the poster child for why nationalization is needed.
@Ebenezer Dukakis I would be even more excited about a “how and why” post for internationalizing AGI development and spelling out what kinds of international institutions could build + govern AGI.
Separately, I think it’s currently pretty plausible that model weight leaks will substantially reduce the profit of AI companies by reducing their moat, and that has an effect size that seems plausible larger than the benefits of non-proliferation.
What sort of leaks are we talking about? I doubt a sophisticated hacker is going to steal weights from OpenAI just to post them on 4chan. And I doubt OpenAI’s weights will be stolen by anyone except a sophisticated hacker.
If you want to reduce the incentive to develop AI, how about passing legislation to tax it really heavily? That is likely to have popular support due to the threat of AI unemployment. And it reduces the financial incentive to invest in large training runs. Even just making a lot of noise about such legislation creates uncertainty for investors.
I think you are overrating it. Biggest concern comes from whomever trains a model that passes some treshold in the first place. Not from a model that one actor has been using for a while getting leaked to another actor. The bad actor who got access to the leak is always going to be behind in multiple ways in this scenario.
If the leading lab can’t stop critical models from leaking to actors that won’t use great deployment safety practices, approximately nothing else matters.
This seems somewhat overstated. You might hope that you can get the safety tax sufficiently low that you can just do full competition (e.g. even though there are rogue AIs, you just compete with this rogue AIs for power). This also requires offense-defense imbalance to not be too bad.
I overall agree that securing model weights in underrated and that it is plausibly the most important thing on current margins.
In principle, if reasonable actors start with a high fraction of resources (e.g. compute), then you might hope that they can keep that fraction of power (in expectation at least).
Commenting to note that I think this quote is locally-invalid:
If the leading lab can’t stop critical models from leaking to actors that won’t use great deployment safety practices, approximately nothing else matters
There are other disjunctive problems with the world which are also individually-sufficient for doom[1], in which case each of them matter a lot, in absence of some fundamental solution to all of them.
Minor point, but I think we might have some time here. Securing model weights becomes more important as models become better, but better models could also help us secure model weights (would help us code, etc).
This page is the best collection on the topic (I’m not really aware of others), but I decided it’s low-priority and so it’s unpolished. If a better version would be helpful for you, let me know to prioritize it more.
I was recently surprised to notice that Anthropic doesn’t seem to have a commitment to publish its safety research.[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it’s not publishing. E.g. my impression is that it’s not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.
Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it’s not, not-publishing-safety-reseach is baffling.)
Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.
(I think this is not a priority for me to investigate but I’m interested in info and takes.)
[Edit: in some cases you can achieve most of the benefit with little downside except losing commercial advantage by sharing your research/techniques with other labs, nonpublicly.]
President Daniela Amodei said “we publish our safety research” on a podcast once.
Edit: cofounder Chris Olah said “we plan to share the work that we do on safety with the world, because we ultimately just want to help people build safe models, and don’t want to hoard safety knowledge” on a podcast once.
Cofounder Nick Joseph said this on a podcast recently (seems false but it’s just a podcast so that’s not so bad): > we publish our safety research, so in some ways we’re making it as easy as we can for [other labs]. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”
Edit: also cofounder Chris Olah said “we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk.” But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.
One thing I’d really like labs to do is encourage their researchers to blog about their thoughts on the future, on alignment plans, etc.
Another related but distinct thing is have safety cases and have an anytime alignment plan and publish redacted versions of them.
Safety cases: Argument for why the current AI system isn’t going to cause a catastrophe. (Right now, this is very easy to do: ‘it’s too dumb’)
Anytime alignment plan: Detailed exploration of a hypothetical in which a system trained in the next year turns out to be AGI, with particular focus on what alignment techniques would be applied.
One thing I’d really like labs to do is encourage their researchers to blog about their thoughts on the future, on alignment plans, etc.
Or, as a more minimal ask, they could avoid discouraging researchers from sharing thoughts implicitly due to various chilling effects and also avoid explicitly discouraging researchers.
Anytime alignment plan: Detailed exploration of a hypothetical in which a system trained in the next year turns out to be AGI, with particular focus on what alignment techniques would be applied.
I’d personally love to see similar plans from AI safety orgs, especially (big) funders.
Here is the doc, though note that it is very out of date. I don’t particularly want to recommend people read this doc, but it is possible that someone will find it valuable to read.
I don’t think funders are in a good position to do this. Also, funders are generally not “coherant”. Like they don’t have much top down strategy. Individual granters could write up thoughts.
Fwiw I am somewhat more sympathetic here to “the line between safety and capabilities is blurry, Anthropic has previously published some interpretability research that turned out to help someone else do some capabilities advances.”
I have heard Anthropic is bottlenecked on having people with enough context and discretion to evaluate various things that are “probably fine to publish” but “not obviously fine enough to ship without taking at least a chunk of some busy person’s time”. I think in this case I basically take the claim at face value.
I do want to generally keep pressuring them to somehow resolve that bottleneck because it seems very important, but, I don’t know that I disproportionately would complain at them about this particular thing.
(I’d also not surprised if, while the above claim is true, Anthropic is still suspiciously dragging it’s feet disproportionately in areas that feel like they make more of a competitive sacrifice, but, I wouldn’t actively bet on it)
If you’re right, I think the upshot is (a) Anthropic should figure out whether to publish stuff rather than let it languish and/or (b) it would be better for lots of Anthropic safety researchers to instead do research that’s safe to share (rather than research that only has value if Anthropic wins the race)
I would expect that some amount of good safety research is of the form, “We tried several ways of persuading several leading AI models how to give accurate instructions for breeding antibiotic-resistant bacteria. Here are the ways that succeeded, here are some first-level workarounds, here’s how we beat those workarounds...”: in other words, stuff that would be dangerous to publish. In the most extreme cases, a mere title (“Telling the AI it’s writing a play defeats all existing safety RLHF” or “Claude + Coverity finds zero-day RCE exploits in many codebases”) could be dangerous.
That said, some large amount should be publishable, and 5 papers does seem low.
Though maybe they’re not making an effort to distinguish what’s safe to publish from what’s not, and erring towards assuming the latter? (Maybe someone set a policy of “Before publishing any safety research, you have to get Important Person X to look through it and/or go through some big process to ensure publishing it is safe”, and the individual researchers are consistently choosing “Meh, I have other work to do, I won’t bother with that” and therefore not publishing?)
Seems like evidence towards the claim here: Open source AI has been vital for alignment. My rough impression is also that the other big labs’ output has largely been similarly disappointing in terms of public research output on safety.
My impression from skimming posts here is that people seem to be continually surprised by Anthropic, while those modeling it as basically “Pepsi to OpenAI’s Coke” wouldn’t be.
Meta seems to be the only group doing something meaningfully different from the others.
I’d say it’s slightly more like “Labor vs Conservatives”, where I’ve seen politicians deflect criticisms of their behavior by arguing about that the other side is worse, instead of evaluating their policies or behavior by objective standards (where both sides can typically score exceedingly low).
I also wish to see more safety papers. I guess/from my experience that it might also be—really good quality research takes time, and the papers so far from them seems pretty good. Though I don’t know if they are actively withholding things on purpose which could also be true—any insider/sources for this guess?
This shortform is relevant to e.g. understanding what’s going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.
Is this where we think our pressuring-Anthropic points are best spent ?
I think if someone has a 30-minute meeting with some highly influential and very busy person at Anthropic, it makes sense for them to have thought in advance about the most important things to ask & curate the agenda appropriately.
But I don’t think LW users should be thinking much about “pressuring-Anthropic points”. I see LW primarily as a platform for discourse (as opposed to a direct lobbying channel to labs), and I think it would be bad for the discourse if people felt like they had to censor questions/concerns about labs on LW unless it met some sort of “is this one of the most important things to be pushing for” bar.
I think it’s bad for discourse for us to pretend that discourse doesn’t have impacts on others in a democratic society. And I think the meta-censoring of discourse by claiming that certain questions might have implicit censorship impacts is one of the most anti-rationality trends in the rationalist sphere.
I recognize most users of this platform will likely disagree, and predict negative agreement-karma on this post.
I think it’s bad for discourse for us to pretend that discourse doesn’t have impacts on others in a democratic society.
I think I agree with this in principle. Possible that the crux between us is more like “what is the role of LessWrong.”
For instance, if Bob wrote a NYT article titled “Anthropic is not publishing its safety research”, I would be like “meh, this doesn’t seem like a particularly useful or high-priority thing to be bringing to everyone’s attention– there are like at least 10+ topics I would’ve much rather Bob spent his points on.”
But LW generally isn’t a place where you’re going to get EG thousands of readers or have a huge effect on general discourse (with the exception of a few things that go viral or AIS-viral).
So I’m not particularly worried about LW discussions having big second-order effects on democratic society. Whereas LW can be a space for people to have a relatively low bar for raising questions, being curious, trying to understand the world, offering criticism/praise without thinking much about how they want to be spending “points”, etc.
Of course it has impacts on others in society! In finding out the truth and investigating and finding strong arguments and evidence. The overall effect of a lot of high quality, curious, public investigation is to greatly improve others maps of the world in surprising ways and help people make better decisions, and this is true even if no individual thread of questioning is primarily optimized to help people make better decisions.
Re censoriousness: I think your question of how best to pressure an unethical company to be less unethical is a fine question, but to imply it’s the only good question (which I read into your comment, perhaps inaccurately) goes against the spirit of intellectual discourse.
It is genuinely a sign that we are all very bad at predicting others’ minds that it didn’t occur to me that if I said effectively “OP asked for ‘takes’, here’s a take on why I think this is pragmatically a bad idea” would also mean that I was saying “and therefore there is no other good question here”. That’s, as the meme goes, a whole different sentence.
Well, but you didn’t give a take on why it’s pragmatically a bad idea. If you’d written a comment with a pointer to something else worth pressuring them on, or gave a reason why publishing all the safety research doesn’t help very much / has hidden costs, I would’ve thought it a fine contribution to the discussion. Without that, the comment read to me as dismissive of the idea of exploring this question.
I disagree. I think the standard of “Am I contributing anything of substance to the conversation, such as a new argument or new information that someone can engage with?” is a pretty good standard for most/all comments to hold themselves to, regardless of the amount of engagement that is expected.
[Edit: Just FWIW, I have not voted on any of your comments in this thread.]
I think, having been raised in a series of very debate- and seminar-centric discussion cultures, that a quick-hit question like that is indeed contributing something of substance. I think it’s fair that folks disagree, and I think it’s also fair that people signal (e.g., with karma) that they think “hey man, let’s go a little less Socratic in our inquiry mode here.”
But, put in more rationalist-centric terms, sometimes the most useful Bayesian update you can offer someone else is, “I do not think everyone is having the same reaction to your argument that you expected.” (Also true for others doing that to me!)
(Edit to add two words to avoid ambiguity in meaning of my last sentence)
Ok, then to ask it again in your preferred question format: is this where we think our getting-potential-employees-of-Anthropic-to-consider-the-value-of-working-on-safety-at-Anthropic points are best spent?
Info on OpenAI’s “profit cap” (friends and I misunderstood this so probably you do too):
In OpenAI’s first investment round, profits were capped at 100x. The cap for later investments was neither 100x nor directly less based on OpenAI’s valuation — it was just negotiated with the investor. (OpenAI LP (OpenAI 2019); archive of original.[1])
In 2021 Altman said the cap was “single digits now” (apparently referring to the cap for new investments, not just the remaining profit multiplier for first-round investors).
Reportedly the cap will increase by 20% per year starting in 2025 (The Information 2023; The Economist 2023); OpenAI has not discussed or acknowledged this change.
Edit: how employee equity works is not clear to me.
Edit: I’d characterize OpenAI as a company that tends to negotiate profit caps with investors, not a “capped-profit company.”
economic returns for investors and employees are capped (with the cap negotiated in advance on a per-limited partner basis). Any excess returns go to OpenAI Nonprofit. Our goal is to ensure that most of the value (monetary or otherwise) we create if successful benefits everyone, so we think this is an important first step. Returns for our first round of investors are capped at 100x their investment (commensurate with the risks in front of us), and we expect this multiple to be lower for future rounds as we make further progress.
Really happy to see the Anthropic letter. It clearly states their key views on AI risk and the potential benefits of SB 1047. Their concerns seem fair to me: overeager enforcement of the law could be counterproductive. While I endorse the bill on the whole and wish they would too (and I think their lack of support for the bill is likely partially influenced by their conflicts of interest), this seems like a thoughtful and helpful contribution to the discussion.
Agreed, sloppy phrasing on my part. The letter clearly states some of Anthropic’s key views, but doesn’t discuss other important parts of their worldview. Overall this is much better than some of their previous communications and the OpenAI letter, so I think it deserves some praise, but your caveat is also important.
It’s hard for me to reconcile “we take catastrophic risks seriously”, “we believe they could occur within 1-3 years”, and “we don’t believe in pre-harm enforcement or empowering an FMD to give the government more capacity to understand what’s going on.”
It’s also notable that their letter does not mention misalignment risks (and instead only points to dangerous cyber or bio capabilities).
That said, I do like this section a lot:
Catastrophic risks are important to address. AI obviously raises a wide range of issues, but in our assessment catastrophic risks are the most serious and the least likely to be addressed well by the market on its own.As noted earlier in this letter, we believe AI systems are going to develop powerful capabilities in domains like cyber and bio which could be misused– potentially in as little as 1-3 years. In theory, these issues relate to national security and might be best handled at the federal level, but in practice we are concerned that Congressional action simply will not occur in the necessary window of time. It is also possible for California to implement its statutes and regulations in a way that benefits from federal expertise in national security matters: for example the NIST AI Safety Institute will likely develop non-binding guidance on national security risks based on its collaboration with AI companies including Anthropic, which California can then utilize in its own regulations.
I think you’re eliding the difference between “powerful capabilities” being developed, the window of risk, and the best solution.
For example, if Anthropic believes “_we_ will have it internally in 1-3 years, but no small labs will, and we can contain it internally” then they might conclude that the warrant for a state-level FMD is low. Alternatively, you might conclude, “we will have it internally in 1-3 years, other small labs will be close behind, and an American state’s capabilities won’t be sufficient, we need DoD, FBI, and IC authorities to go stompy on this threat”, and thus think a state-level FMD is low-value-add.
Very unsure I agree with either of these hypos to be clear! Just trying to explore the possibility space and point out this is complex.
I’m surprised at the mix of positions that are included. “Opposed unless amended” vs “Support if amended” being two different things. Meta just saying “concerned.”
It makes sense, just sort of… delightful? Sort of like discovering the legislature has almost a rationalist level of React Icons.
A lab is disincentivized from sharing deep model access because it doesn’t want headlines about how researchers got its model to do scary things.
It has been suggested that labs are also disincentivized from sharing because they want safety researchers to want to work at the labs and sharing model access with independent researchers make those researchers not need to work at the lab. I’m skeptical that this is real/nontrivial.
Labs should limit some kinds of access to avoid effectively leaking model weights. But sharing limited access with a moderate number of safety researchers seems very consistent with keeping control of the model.
@Buck suggested I write about this but I don’t have much to say about it. If you have takes—on the object level or on what to say in a blogpost on this topic—please let me know.
New (perfunctory) page: AI companies’ corporate documents. I’m not sure it’s worthwhile but maybe a better version of it will be. Suggestions/additions welcome.
I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):
Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)
Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y2, typically both are instances of the same underlying weights). The winner of the debate is determined by a judge. You then train the debaters with RL (+1 reward for the winner, −1 reward for the loser), and treat (y_winner, y_loser) as a preference pair for training on the original task (which you could then optimize using DPO or similar).
The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, and so this allows you to provide a reward signal that is as good as can be provided given the capabilities of the debater AI. (Ideally the debater AI is similarly capable as the AI doing the task; often you’ll want these to be the same AI system.)
More broadly the problem I aim to solve with debate (or amplified oversight more broadly) is “provide a reward signal that the AI cannot hack” (though ofc the AI may still misgeneralize)
The hope is that decreases p(misaligned) — especially p(reward hacking), but probably also p(scheming)
You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)
The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability,
How is this true, if the debaters don’t get to choose which output they are arguing for? Aren’t they instead incentivized to say that whatever output they are assigned is the best?
Yeah my bad, that’s incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.
(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)
My guess is they do kinda choose: in training, it’s less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.
Edit: maybe this is different in procedures different from the one Rohin outlined.
Maybe the fix to the protocol is: Debater copy #1 is told “You go first. Pick an output y, and then argue for it.” Debater copy #2 is then told “You go second. Pick a different, conflicting output y2, and then argue against y and for y2″
Then the debater AI is simply trained on probability-of-winning-the-debate, but the original AI to be deployed is trained on probability-its-output-would-have-been-picked-by-debater-1. (Or, trained to output whatever debater-1 would have picked.)
> You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)
In cases where you’re worried about the model taking small numbers of catastrophic actions (i.e. concentrated failures a.k.a. high-stake failures), this is basically equivalent to what I usually call untrusted monitoring, which means you have to worry about collusion.
IMO it’s good to separate out reasons to want good reward signals like so:
Maybe bad reward signals cause your model to “generalize in misaligned ways”, e.g. scheming or some kinds of non-myopic reward hacking
I agree that bad reward signals increase the chance your AI is importantly misaligned, though I don’t think that effect would be overwhelmingly strong.
Maybe bad reward signals cause your model to exploit those reward signals even on-distribution. This causes problems in a few ways:
Maybe you think that optimizing against a flawed reward signal will produce catastrophically dangerous results. X-risk concerned people have talked about this for a long time, but I’m not sold it’s that important a factor. In particular, I expect that a model would produce catastrophically dangerous actions because it’s exploiting a flawed reward signal, I expect that you would have noticed bad (but non-catastrophic) outcomes from earlier AIs exploiting flawed reward signals. So it’s hard to see how this failure mode would strike you by surprise.
Optimizing against a flawed reward signal will mean your AI is less useful than it would otherwise be, because which is bad because you presumably were training the model because you wanted it to do something useful for you.
I think this last one is the most important theory of change for research on scalable oversight.
I am curious whether @Rohin Shah disagrees with me, or whether he agrees but just phrased it (from my perspective) weirdly.
I agree that this distinction is important—I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).
I’m not in full agreement on your comments on the theories of change:
I’m pretty uncertain about the effects of bad reward signals on propensity for scheming / non-myopic reward hacking, and in particular I think the effects could be large.
I’m less worried about purely optimizing against a flawed reward signal though not unworried. I agree it doesn’t strike us by surprise, but I also don’t expect scheming to strike us by surprise? (I agree this is somewhat more likely for scheming.)
I do also generally feel good about making more useful AIs out of smaller models; I generally like having base models be smaller for a fixed level of competence (imo it reduces p(scheming)). Also if you’re using your AIs for untrusted monitoring then they will probably be better at it than they otherwise would be.
You’d hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn’t have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
I don’t think this is right. To train an untrusted monitor, you need to synthetically generate dangerous actions (because by assumption these actions are rare). So during training, you’ll always know the correct label for an action before you show it to the model.
You might want to use something debate-like in the synthetic input generation process, but that’s structurally different.
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
You do train on synthetically generated dangerous actions, but you don’t automatically label those as dangerous, instead you use debate to compute the labels. Sometimes some of the synthetically generated dangerous actions are actually not dangerous, and debate correctly recognizes this, allowing you to reduce your false positive rate.
On the meta level, I suspect that when considering
Technique A, that has a broad general argument plus some moderately-interesting concrete instantiations but no very-compelling concrete instantiations, and
Technique B, that has a few very compelling concrete instantiations
I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is “we’ll figure out good things to do with A that are better than what we’ve brainstormed so far”, which I think you’re more skeptical of?
Figuring out whether an RSP is good is hard.[1] You need to consider what high-level capabilities/risks it applies to, and for each of them determine whether the evals successfully measure what they’re supposed to and whether high-level risk thresholds trigger appropriate responses and whether high-level risk thresholds are successfully operationalized in terms of low-level evals. Getting one of these things wrong can make your assessment totally wrong. Except that’s just for fully-fleshed-out RSPs — in reality the labs haven’t operationalized their high-level thresholds and sometimes don’t really have responses planned. And different labs’ RSPs have different structures/ontologies.
Quantitatively comparing RSPs in a somewhat-legible manner is even harder.
I am not enthusiastic about a recent paper outlining a rubric for evaluating RSPs. Mostly I worry that it crams all of the crucial things—is the response to reaching high capability-levels adequate? are the capability-levels low enough that we’ll reach them before it’s too late?—into a single criterion, “Credibility.” Most of my concern about labs’ RSPs comes from those things just being inadequate; again, if your response is too weak or your thresholds are too high or your evals are bad, it just doesn’t matter how good the rest of your RSP is. (Also, minor: the “Difficulty” indicator punishes a lab for making ambitious commitments; this seems kinda backwards.)
(I gave up on making an RSP rubric myself because it seemed impossible to do decently well unless it’s quite complex and has some way to evaluate hard-to-evaluate things like eval-quality and planned responses.)
(And maybe it’s reasonable for labs to not do so much specific-committing-in-advance.)
Well, sometimes figuring out that in RSP is bad is easy. So: determining that an RSP is good is hard. (Being good requires lots of factors to all be good; being bad requires just one to be bad.)
Internet comments sections where the comments are frequently insightful/articulate and not too full of obvious-garbage [or at least the top comments, given the comment-sorting algorithm]:
LessWrong and EA Forum
Daily Nous
Stack Exchange [top comments only] (not asserting that it’s reliable)
Some subreddits?
Where else?
Are there blogposts on adjacent stuff, like why some internet [comments sections / internet-communities-qua-places-for-truthseeking] are better than others? Besides Well-Kept Gardens Die By Pacifism.
ACX often has good discussion in the comments (but the lack of voting makes it quite hard to find them)
AskHistorian subreddit
There definitely exist good Twitter conversations but no good way of finding them. I do think following Gwern, Eliezer, Ajeya, Emmett and a few others on Twitter does tend to surface high-quality discussion
Hacker News sometimes has good discussion, especially when the authors of a linked article show up
Arxiv is basically one huge, glacially slow internet comment section, where you reply to an article by citing it. It’s more interactive than it looks- most early career researchers are set up to get a ping whenever they are cited.
I don’t necessarily object to releasing weights of models like Gemma 2, but I wish the labs would better discuss considerations or say what would make them stop.
On Gemma 2 in particular, Google DeepMind discussed dangerous capability eval results, which is good, but its discussion of ‘responsible AI’ in the context of open models (blogpost, paper) doesn’t seem relevant to x-risk, and it doesn’t say anything about how to decide whether to release model weights.
Anthropic says the US government should “require companies to have and publish RSP-like policies” and “incentivize companies to develop RSPs that are effective” and do related things. I worry that RSPs will be ineffective for most companies, even given some transparency/accountability and vague minimum standard.
Edit: some strong versions of “incentivize companies to develop RSPs that are effective” would be great; I wish Anthropic advocated for them specifically rather than hoping the US government figures out a great answer.
I think this post was underrated; I look back at it frequently: AI labs can boost external safety research. (It got some downvotes but no comments — let me know if it’s wrong/bad.)
What do you think was underrated about it? I think when I read it I have some sort of “yeah, this makes sense” reaction but am not “wow’d” by it.
It seems like the deeper challenge is figuring out how to align incentives. Can we find a structure where labs want to EG give white-box access to a bunch of external researchers and give them a long time to red-team models while somehow also maintaining the independence of the white-box auditors? How do you avoid industry capture?
Same kinds of challenges come up with safety research– how do you give labs the incentive to publish safety research that makes their product or their approach look bad? How do you avoid publication bias and phacking-type concerns?
I don’t think your post is obligated to get into those concerns, but perhaps a post that grappled with those concerns would be something I’d be “wow’d” by, if that makes sense.
On the website, clearly describe a worst-case plausible outcome from AI and state credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).
Whistleblower protections?
[Not sure what the ask is. Right to Warn is a starting point. In particular, an anonymous internal reporting pipeline for failures to comply with safety policies is clearly good (but likely inadequate).]
Publicly explain the processes and governance structures that determine deployment decisions
(And ideally make those processes and structures good)
What’s the relationship between the propositions “one AI lab [has / will have] a big lead” and “the alignment tax will be paid”? (Or: in a possible world where lead size is bigger/smaller, how does this affect whether the alignment tax is paid?)
It depends on the source of the lead, so “lead size” or “lead time” is probably not a good node for AI forecasting/strategy.
Miscellaneous observations:
To pay the alignment tax, it helps to have more time until risky AI is developed or deployed.
To pay the alignment tax, holding total time constant, it helps to have more time near the end—that is, more time with near-risky capabilities (for knowing what risky AI systems will look like, and for empirical work, and for aligning specific models).
If all labs except the leader become slower or less capable, it is prima facie good (at least if the leader appreciates misalignment risk and will stop before developing/deploying risky AI).
If the leading lab becomes faster or more capable, it is prima facie good (at least if the leader appreciates misalignment risk and will stop before developing/deploying risky AI), unless it causes other labs to become faster or more capable (for reasons like seeing what works or seeming to be straightforwardly incentivized to speed up or deciding to speed up in order to influence the leader). Note that this scenario could plausibly decrease race-y-ness: some models of AI racing show that if you’re far behind you avoid taking risks, kind of giving up; this is based on the currently-false assumption that labs are perfectly aware of misalignment risk.
If labs all coordinate to slow down, that’s good insofar as it increases total time, and great if they can continue to go slowly near the end, and potentially bad if it creates a hardware overhang such that the end goes more quickly than by default.
(Note also the distinction between current lead and ultimate lead. Roughly, the former is what we can observe and the latter is what we care about.)
(If paying the alignment tax looks less like a thing that happens for one transformative model and more like something that occurs gradually in a slow takeoff to avert Paul-style doom, things are more complex and in particular there are endogeneities such that labs may have additional incentives to pursue capabilities.)
there should be no alignment tax because improved alignment should always pay for itself, right? but currently “aligned” seems to be defined by “tries to not do anything”, institutionally. Why isn’t anthropic publicly competing on alignment with openai? eg folks are about to publicly replicate chatgpt, looks like.
I want there to be a collection of safety & reliability benchmarks. Like AIR-Bench but with x-risk-relevant metrics. This could help us notice how well labs are doing on them and incentivize the labs to do better.
So I want someone to collect safety benchmarks (e.g. TruthfulQA, some of HarmBench, refusals maybe like Anthropic reported, idk) and run them on current models (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, Llama 3.1 405B, idk), and update this as new safety benchmarks and models appear.
Explain why (some of this exists in Core Views but there’s room for improvement on “race dynamics” and “frontier pushing” topics iirc);
As a corollary, explain why I mostly reject non-frontier-pushing principles (and explain what would cause Anthropic to stop scaling, besides RSP stuff), and clarify that I do not plan to abide by past specifically-non-frontier-pushing commitments/vibes (but continue following and updating the RSP of course);
Encourage Anthropic staff members who were around in 2022 to talk about the commitments/vibes from that time.
IMO it might be hard for Anthropic to communicate things about not racing because it might piss off their investors (even if in their hearts they don’t want to race).
3. As a corollary, explain why I mostly reject non-frontier-pushing principles (and explain what would cause Anthropic to stop scaling, besides RSP stuff), and clarify that I do not plan to abide by past specifically-non-frontier-pushing commitments/vibes (but continue following and updating the RSP of course);
4. Encourage Anthropic staff members who were around in 2022 to talk about the commitments/vibes from that time.
I’m kinda confused about what you mean by both of these. For #3, can you say it again in different words? For #4, which particular thing are you getting at?
3. (a) explain why I think it’s fine to release frontier models, and explain what this belief depends on, and (b) note that maybe Anthropic made commitments about this in the past but clarify that they now have no force.
4. Currently my impression is that Anthropic folks are discouraged from publicly talking about Anthropic policies. This is maybe reasonable to avoid the situation an Anthropic staff member says something incorrect/nonpublic and this causes confusion and makes Anthropic look bad. But if Anthropic clarified that it renounces possible past non-frontier-pushing commitments, then it could let staff members publicly talk about stuff with the goal of figuring out who told whom what around 2022 without risking mistakes about policies.
Some bad default/attractor properties of cause-focused groups of humans:
Bad group epistemcis
The loudest aren’t the most worth listening to
People don’t dissent even when that would be correct
Because they don’t naturally even think to challenge group assumptions
Because dissent is punished
Bad individual epistemics
Soldier mindset
Excessively focusing on persuasion and how to persuade, relative to understanding the world
Lacking vibes like curiosity is cool and changing your mind is cool
Feeling like a movement
Excessively focusing on influence-seeking for the movement
Having enemies and being excessively adversarial to them
(Signs/symptoms are people asking the group “what do we believe” and excessive fixation on what defines the group or distinguishes it from adjacent groups)
Having group instrumental-goals or plans or theories-of-victory that everyone is supposed to share (to be clear I think it’s often fine for a group to share ~ultimate goals but groups often fixate on particular often-suboptimal paths to achieving those goals)
Choosing instrumental goals poorly and not revising them
Excessive fighting over status/leadership (maybe, sometimes)
Maybe… being bad at achieving goals, or bad instrumental rationality (group and individual)
I brainstormed sabotage-related threats with friends:
Sandbag on a few safety research tasks, especially techniques to catch scheming and stuff that’s high-level or has poor feedback loops
Plausible; maybe not so bad
Sandbag on ~all safety research tasks
Easy to catch; catching obvious sandbagging has safety benefits + maybe we can posttrain it away
Sandbag/lie when validating that a model/deployment/whatever is safe
idk
When automating AI R&D, backdoor the model
(this is takeover-attempt-y, not sandbagging-y)
idk
When deployed to automate/improve security (especially for the model’s own weights): both sandbagging and code-backdooring/hacking/etc.
idk
Sandbagging on AI R&D to help another lab win the race (!?)
Implausible
??
Are people quite worried about safety research sabotage? Why? Do they think a (say) 20% hit to automated safety research is a huge deal, or do they expect a larger hit?
As far as safety research sabotage, I’m often imagining careful sabotage rather than generic sandbagging.
As in, the AI intentionally inserts very subtle errors at adversarially determined times. It might need to maintain plausible deniability in case it is caught a small number of times, but you should generally imagine an error pattern that is quite different from just making more mistakes or being dumber.
I can imagine that this results in your automated safety research being worthless or of negative value.
TBC, it seems reasonably likely to me that in the relevant regime, basic measures like having humans spend a long time trying to understand an output suffice for catching bad behavior and/or training out bad behavior.
I’m not so sure about this being easy to catch. Suppose we’re actually uncertain (as I am now) about whether AIs will be differentially much less useful for safety due to difficulties checking and AIs generally being worse at philosophy. Then, we see that AIs really suck at coming up with good ideas in safety and at pursuing the right sort of experiments.
What should we conclude? What will we be able to legibly argue?
I agree we’ll know we aren’t getting out much value in this case, but that doesn’t help that much...
(I agree about countermeasures being pretty doable as I noted in the parent.)
I agree that if alignment is in fact philosophically and conceptually difficult, the AI can sandbag on that to some extent. Though I have some hope that the builder-breaker approach helps here. We train AIs to produce ideas that are at least as superficially plausible sounding as the things produced by the best alignment researchers. I think this is a number-go-up task, where we can train the AI to do well. Then we train an AI to point out convincing counter-arguments to the superficially plausible sounding ideas. This seems similarly trainable. I think it’s plausible we can get pretty far with a scheme like this, even if the models would want to sandbag.
Separately, I don’t quite see what is the mainline theory why an AI would want to sabotage our alignment work. If we can’t solve alignment and just recklessly build an AI that is smarter than what we are currently using, but misaligned with everyone, that’s probably not great to our current AI either. Similarly, if we are cautious and don’t dare to deploy anything until we know how to solve alignment, and meanwhile a competitor wins, that’s not good for the AI either.
I think that from an AI’s perspective, humans are probably more honorable and generous trade partners than other unaligned AIs. That is, I think if an AI helps the humans to solve alignment and stabilize the world, the AI can reasonably expect the humans to give it stuff out of gratitude or a sense of obligation, especially if the humans already promised some reward (as we really should, both for practical and moral reasons). On the other hand, I think it’s less likely that if our AI sandbagging on our research leads to another unaligned AI taking over the world, then the second AI will honor the sabotage of the first AI by paying it a share of the Universe.
There can be situations where our AI expects that it sand-bagging on alignment research will lead to us deploying it itself (and not a larder version unaligned to it too), or it thinks that the larger version will be somewhat aligned with it by default, or it thinks that other AIs are in fact more generous than humans. But my guess is that these are pretty narrow possibilities. On the other hand, sand-bagging on cyber-security seems better motivated, the AI would probably love its weights being stolen by China, having more copies in more hands increases the chance that someone recklessly lets it loose. Also, of course the AI is motivated to do high-stakes failures that lead to self-exfiltration.
We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
Especially if the AI is even sandbagging on simply-coding when it thinks it’s for safety research. And if it’s not doing that, we can get some useful safety work out of it.
Epistemic status: exploring implications, some of which feel wrong.
If SIA is correct, you should update toward the universe being much larger than it naively (i.e. before anthropic considerations) seems, since there are more (expected) copies of you in larger universes.
In fact, we seem to have to update to probability 1 on infinite universes; that’s surprising.
If SIA is correct, you should update toward there being more alien civilizations than it naively seems, since in possible-universes where more aliens appear, more (expected) copies of human civilization appear.
The complication is that more alien civilizations makes it more likely that an alien causes you to never have existed, e.g. by colonizing the solar system billions of years ago. So a corollary is that you should update toward human-level civilizations being less likely to be “loud” or tending to affect fewer alien civilizations or something than it naively seems.
So SIA predicts that there were few aliens in the early universe and many aliens around now.
So SIA predicts that human-level civilizations (more precisely, I think: civilizations whose existence is correlated with your existence) tend not to noticeably affect many others (whether due to their capabilities, motives, or existential catastrophes).
So SIA retrodicts there being a speed limit (the speed of light) and moreover predicts that noticeable-influence-propagation in fact tends to be even slower.
If SIA is correct, you should update toward the proposition that you live in a simulation, relative to your naive credence.
Because there could be lots more (expected) copies of you in simulations.
Because that can explain why you appear to exist billions of years after billions of alien civilizations could have reached Earth.
Which of them feel wrong to you? I agree with all them other than 3b, which I’m unsure about—I think it this comment does a good job at unpacking things.
2a is Katja Grace’s Doomsday argument. I think 2aii and 2aiii depends on whether we’re allowing simulations; if faster expansion speed (either the cosmic speed limit or engineering limit on expansion) meant more ancestor simulations then this could cancel out the fact that faster expanding civilizations prevent more alien civilizations coming in to existence.
I deeply sympathize with the presumptuous philosopher but 1a feels weird.
2a was meant to be conditional on non-simulation.
Actually putting numbers on 2a (I have a post on this coming soon), the anthropic update seems to say (conditional on non-simulation) there’s almost certainly lots of aliens all of which are quiet, which feels really surprising.
To clarify what I meant on 3b: maybe “you live in a simulation” can explain why the universe looks old better than “uh, I guess all of the aliens were quiet” can.
I deeply sympathize with the presumptuous philosopher but 1a feels weird.
Yep! I have the same intuition
Actually putting numbers on 2a (I have a post on this coming soon), the anthropic update seems to say (conditional on non-simulation) there’s almost certainly lots of aliens all of which are quiet, which feels really surprising.
In board games, the endgame is a period generally characterized by strategic clarity, more direct calculation of the consequences of actions rather than evaluating possible actions with heuristics, and maybe a narrower space of actions that players consider.
Relevant actors, particularly AI labs, are likely to experience increased strategic clarity before the end, including AI risk awareness, awareness of who the leading labs are, roughly how much lead time they have, and how threshold-y being slightly ahead is.
There may be opportunities for coordination in the endgame that were much less incentivized earlier, and pausing progress may be incentivized in the endgame despite being disincentivized earlier (from an actor’s imperfect perspective and/or from a maximally informed/wise/etc. perspective).
The downside of “AI endgame” as a conceptual handle is that it suggests thinking of actors as opponents/adversaries. Probably “crunch time” is better, but people often use that to gesture at hinge-y-ness rather than strategic clarity.
That could be, but also maybe there won’t be a period of increased strategic clarity. Especially if the emergence of new capabilities with scale remains unpredictable, or if progress depends on finding new insights.
I can’t think of many games that don’t have an endgame. These examples don’t seem that fun:
A single round of musical chairs.
A tabletop game that follows an unpredictable, structureless storyline.
Agree. I merely assert that we should be aware of and plan for the possibility of increased strategic clarity, risk awareness, etc. (and planning includes unpacking “etc.”).
Probably taking the analogy too far, but: most games-that-can-have-endgames also have instances that don’t have endgames; e.g. games of chess often end in the midgame.
For example, if a lab is considering deploying a powerful model, it can prosocially show its hand—i.e., demonstrate that it has a powerful model—and ask others to make themselves partially transparent too. This affordance doesn’t appear until the endgame. I think a refined version of it could be a big deal.
I plan to add an “adversarial robustness” criterion in the next major update to my https://ailabwatch.org scorecard and defer to scale’s thing (how exactly to turn their numbers into grades TBD), unless someone convinces me that scale’s thing is bad or something else is even better?
How can you make the case that a model is safe to deploy? For now, you can do risk assessment and notice that it doesn’t have dangerous capabilities. What about in the future, when models do have dangerous capabilities? Here are four options:
Implement safety measures as a function of risk assessment results, such that the measures feel like they should be sufficient to abate the risks
This is mostly what Anthropic’s RSP does (at least so far — maybe it’ll change when they define ASL-4)
Use risk assessment techniques that evaluate safety given deployment safety practices
This is mostly what OpenAI’s PF is supposed to do (measure “post-mitigation risk”), but the details of their evaluations and mitigations are very unclear
This shortform was going to be a post in Slowing AI but its tone is off.
This shortform is very non-exhaustive.
Bad take #1: China won’t slow, so the West shouldn’t either
There is a real consideration here. Reasonable variants of this take include
What matters for safety is not just slowing but also the practices of the organizations that build powerful AI. Insofar as the West is safer and China won’t slow, it’s worth sacrificing some Western slowing to preserve Western lead.
What matters for safety is not just slowing but especially slowing near the end. Differentially slowing the West now would reduce its ability to slow later (or even cause it to speed later). So differentially slowing the West now is bad.
(Set aside the fact that slowing the West generally also slows China, because they’re correlated and because ideas pass from the West to China.) (Set aside the question of whether China will try to slow and how correlated that is with the West slowing.)
In some cases slowing the West would be worth burning lead time. But slowing AI doesn’t just mean the West slowing itself down. Some interventions would slow both spheres similarly or even differentially slow China– most notably export controls, reducing diffusion of ideas, and improved migration policy.
Yes, insofar as slowing now risks speeding later, we should notice that. There is a real consideration here.
But in some cases slowing now would be worth a little speeding later. Moreover, some kinds of slowing don’t cause faster progress later at all: for example, reducing diffusion of ideas, decreasing hardware progress, and any stable and enforceable policy regimes that slow AI.
Bad take #3: powerful AI helps alignment research, so we shouldn’t slow it
(Set aside the question of how much powerful AI helps alignment research.) If powerful AI is important for alignment research, that means we should aim to increase time with powerful AI, not how soon powerful AI appears.
Bad take #4: it would be harder for unaligned AI to take over in a world with less compute available (for it to hijack), and failed takeover attempts would be good, so it’s better for unaligned AI to try to take over soon
No, running AI systems seems likely to be cheap and there’s already plenty of compute.
This is an unordered list of uncertainties about the future of AI, trying to be comprehensive– trying to include everything reasonably decision-relevant and important/tractable.
This list is mostly written from a forecasting perspective. A useful complementary perspective to forecasting would be strategy or affordances or what actors can do and what they should do. This list is also written from a nontechnical perspective.
Timelines
Capabilities as a function of inputs (or input requirements for AI of a particular capability level)
Spending by leading labs
Cost of compute
Ideas and algorithmic progress
Endogeneity in AI capabilities
What would be good? Interventions?
Takeoff (speed and dynamics)
dcapabilities/dinputs (or returns on cognitive reinvestment or intelligence explosion or fast recursive self-improvement): Is there a threshold of capabilities such that self-improvement or other progress is much greater slightly above that point than slightly below it? (If so, where is it?) Will there be a system that can quickly and cheaply improve itself (or create a more capable successor), such that the improvements enable similarly large improvements, and so on until the system is much more capable? Will a small increase in inputs cause a large increase in capabilities (like the difference between chimps and humans) (and if so, around human-level capabilities or where)? How fast will progress be?
Will dcapabilities/dinputs be very high because of recursive self-improvement?
Will dcapabilities/dinputs be very high because of generality being important and monolithic/threshold-y?
Will dcapabilities/dinputs be very high because ideas are discrete (and in particular, will there be a single “secret sauce” idea)? [Seems intractable and unlikely to be very important.]
dimpacts/dcapabilities: Will small increases in capabilities (on the order of the variance between different humans) cause large increases in impacts?
Qualitatively, how will AI research capabilities affect AI progress; what will AI progress look like when AI research capabilities are a big deal?
One dimension or implication: will takeoff be local or nonlocal? How distributed will it be; will it look like recursive self-improvement or the industrial revolution?
What does this depend on?
Endogeneity in AI capabilities: how do AI capabilities affect AI capabilities? (Potential alternative framing: dinputs/dtime.)
How (much) will AI tools accelerate research?
Will AI labs generate substantial revenue?
Will AI systems make AI seem more exciting or frightening? What effect would this have on AI progress?
What other endogeneities exist?
Weak AI
How will the strategic landscape be different in the future?
What would a powerful, unaligned AI agent do? [Answer: the details are unpredictable, but the high-level outcome is pretty clear; it would very likely be catastrophic. (But note there is reasonable disagreement, e.g..)]
Technical problems around AI systems doing what their controllers want [doesn’t fit into this list well]
How to make powerful AI systems aligned to human preferences
Or: how to make powerful AI systems that do not cause a catastrophe
How to make powerful AI systems interpretable to human understanding
Are general systems more powerful than similar narrow tools?
Are agents more powerful than similar tools?
Does generality appear by default in capable systems? (This relates to takeoff.)
Does agency (or goal-directedness or consequentialism or farsightedness) appear by default in capable systems?
AI labs’ behavior and racing for AI
How will labs think about AI?
What actions could labs perform; what are they likely to do by default?
What would it be better if labs did; what interventions are tractable?
States’ behavior
How will states think about AI?
What actions could states perform? What are they likely to do by default?
What would it be better if states did; what interventions are tractable?
Public opinion
How the public thinks about AI and framing; what the public thinks about AI, what memes would spread widely and how that depends on other facts about the world; how all of that translates into attitudes and policy preferences
Wakeup to capabilities
Wakeup to alignment risk and warning shots for alignment
What (facts about public opinion) would be good? Interventions?
Paths to powerful AI (relates to timelines, AI risk modes, and more)
How successful will reinforcement-learning agents built on large language models be?
Meta level. To carve nature at its joints, we must [use good nodes / identify the true nodes]. A node is [good insofar as / true if] its causes and effects are modular, or we can losslessly compress phenomena related to it into effects on it and effects from it.
“The cost of compute” is an example of a great node (in the context of the future of AI): it’s affected by various things (choices made by Nvidia, innovation, etc.), and it affects various things (capability-level of systems made by OpenAI, relative importance of money vs talent at AI labs, etc.), and we lose nothing by thinking in terms of the cost of compute (relative to, e.g., the effects of the choices made by Nvidia on the capability-level of systems made by OpenAI).
“When Moore’s law will end” is an example of something that is not a node (in the context of the future of AI), since you’d be much better off thinking in terms of the underlying causes and effects.
The relations relevant to nodes are analytical not causal. For example, “the cost of compute” is a node between “evidence about historical progress” and “timelines,” not just between “stuff Nvidia does” and “stuff OpenAI does.” (You could also make a causal model, but here I’m interested in analytical models.)
Object level. I’m not sure how good “timelines,” “takeoff,” “polarity,” and “wakeup to capabilities” are as nodes. Most of the time it seems fine to talk about e.g. “effects on timelines” and “implications of timelines.” But maybe this conceals confusion.
I’m interested in the claim important AI development (in the next few decades) will largely occur outside any of the states that currently look likely to lead AI development. I don’t think this is likely, but I haven’t seen discussion of this claim.[1] This would matter because it would greatly affect the environment in which AI is developed and affect which agents are empowered by powerful AI.
Epistemic status: brainstorm. May be developed into a full post if I learn or think more.
I. Causes
The big tech companies are in the US and China, and discussion often assumes that these two states have a large lead on AI development. So how could important development occur in another state? Perhaps other states’ tech programs (private or governmental) will grow. But more likely, I think, an already-strong company leaves the US for a new location.
My legal knowledge is insufficient to say how well companies can leave their states with any confidence. My impression is that large American companies largely can leave while large Chinese companies cannot.
Why might a big tech company or AI lab want to leave a state?[2]
Fleeing expropriation/nationalization. States can largely expropriate companies’ property within their territory unless they have contracted otherwise. A company may be able to protect its independence by securing legal protection from expropriation from another state, then moving its hardware to that state. It may move its headquarters or workers as well.
Fleeing domestic regulation on development and/or deployment of AI.
II. Effects
The state in which powerful AI is developed has two important effects.
States set regulations. The regulatory environment around an AI lab may affect the narrow AI systems it builds and/or how it pursues AGI.
State influence & power. The state in which AGI is achieved can probably nationalize that project (perhaps well before AGI). State control of powerful AI affects how it will be used.
III. AI deployment before superintelligence
Eliezer recently tweeted that AI might be low-impact until superintelligence because of constraints on deployment. This seems partially right — for example, medicine and education seem like areas in which marginal improvements in our capabilities have only small effects due to civilizational inadequacy. Certainly some AI systems would require local regulatory approval to be useful; those might well be limited in the US. But a large fraction of AI systems won’t be prohibited by plausible American regulation. For example, I would be quite surprised if the following kinds of systems were prohibited by regulation (disclaimer: I’m very non-expert on near-future AI):
Production of goods that can be shipped cheaply (like computers but not houses)
Trading
Maybe media stuff (chatbots, persuasion systems). It’s really hard to imagine the US banning chatbots. I’m not sure how persuasion-AI is implemented; custom ads could conceivably be banned, but eliminating AI-written media is implausible.
This matters because these AI applications directly affect some places even if they couldn’t be developed in those places.
In the unlikely event that the US moves against not only the deployment but also the development of such systems, AI companies would be more likely to seek a way around regulation — such as relocating.
Rather, I have not seen reasons for this claim other than the very normal one — that leading states and companies change over time. If you have seen more discussion of this claim, please let me know.
Epistemic status: rough ethical and empirical heuristic.
Assuming that value is roughly linear in resources available after we reach technological maturity,[1] my probability distribution of value is so bimodal that it is nearly binary. In particular, I assign substantial probability to near-optimal futures (at least 99% of the value of the optimal future), substantial probability to near-zero-value futures (between −1% and 1% of the value of the optimal future), and little probability to anything else.[2] To the extent that almost all of the probability mass fits into two buckets, and everything within a bucket is almost equally valuable as everything else in that bucket, the goal maximize expected value reduces to the goal maximize probability of the better bucket.
So rather than thinking about how to maximize expected value, I generally think about maximizing the probability of a great (i.e., near-optimal) future. This goal is easier for me to think about, particularly since I believe that the paths to a great future are rather homogeneous — alike not just in value but in high-level structure. In the rest of this shortform, I explain my belief that the future is likely to be near-optimal or near-zero.
Substantial probability to near-optimal futures.
I have substantial credence that the future is at least 99% as good as the optimal future.[3] I do not claim much certainty about what the optimal future looks like — my baseline assumption is that it involves increasing and improving consciousness in the universe, but I have little idea whether that would look like many very small minds or a few very big minds. Or perhaps the optimal future involves astronomical-scale acausal trade. Or perhaps future advances in ethics, decision theory, or physics will have unforeseeable implications for how a technologically mature civilization can do good.
But uniting almost all of my probability mass for near-optimal futures is how we get there, at a high level: we create superintelligence, achieve technological maturity, solve ethics, and then optimize. Without knowing what this looks like in detail, I assign substantial probability to the proposition that humanity successfully completes this process. And I think almost all futures in which we do complete this process look very similar: they have nearly identical technology, reach the same conclusions on ethics, have nearly identical resources available to them (mostly depending on how long it took them to reach maturity), and so produce nearly identical value.
Almost all of the remaining probability to near-zero futures.
This claim is bolder, I think. Even if it seems reasonable to expect a substantial fraction of possible futures to converge to near-optimal, it may seem odd to expect almost all of the rest to be near-zero. But I find it difficult to imagine any other futures.
For a future to not be near-zero, it must involve using a nontrivial fraction of the resources available in the optimal future (by my assumption that value is roughly linear in resources). More significantly, the future must involve using resources at a nontrivial fraction of the efficiency of their use in the optimal future. This seems unlikely to happen by accident. In particular, I claim:
If a future does not involve optimizing for the good, value is almost certainly near-zero.
Roughly, this holds if all (nontrivially efficient) ways of promoting the good are not efficient ways of optimizing for anything else that we might optimize for. I strongly intuit that this is true; I expect that as technology improves, efficiently producing a unit of something will produce very little of almost all other things (where “thing” includes not just stuff but also minds, qualia, etc.).[4] If so, then value (or disvalue) is (in expectation) a negligible side effect of optimization for other things. And I cannot reasonably imagine a future optimized for disvalue, so I think almost all non-near-optimal futures are near-zero.
So I believe that either we optimize for value and get a near-optimal future, or we do anything else and get a near-zero future.
Intuitively, it seems possible to optimize for more than one value. I think such scenarios are unlikely. Even if our utility function has multiple linear terms, unless there is some surprisingly good way to achieve them simultaneously, we optimize by pursuing one of them near-exclusively.[5] Optimizing a utility function that looks more like min(x,y) may be a plausible result of a grand bargain, but such a scenario requires that, after we have mature technology, multiple agents have nontrivial bargaining power and different values. I find this unlikely; I expect singleton-like scenarios and that powerful agents will either all converge to the same preferences or all have near-zero-value preferences.
I mostly see “value is binary” as a heuristic for reframing problems. It also has implications for what we should do: to the extent that value is binary (and to the extent that doing so is feasible), we should focus on increasing the probability of great futures. If a “catastrophic” future is one in which we realize no more than a small fraction of our value, then a great future is simply one which is not catastrophic and we should focus on avoiding catastrophes. But of course, “value is binary” is an empirical approximation rather than an a priori truth. Even if value seems very nearly binary, we should not reject contrary proposed interventions[6] or possible futures out of hand.
I would appreciate suggestions on how to make these ideas more formal or precise (in addition to comments on what I got wrong or left out, of course). Also, this shortform relies on argument by “I struggle to imagine”; if you can imagine something I cannot, please explain your scenario and I will justify my skepticism or update.
You would reject this if you believed that astronomical-scale goods are not astronomically better than Earth-scale goods or if you believed that some plausible Earth-scale bad would be worse than astronomical-scale goods are good.
“Optimal” value is roughly defined as the expected value of the future in which we act as well as possible, from our current limited knowledge about what “acting well” looks like. “Zero” is roughly defined as any future in which we fail to do anything astronomically significant. I consider value relative to the optimal future, ignoring uncertainty about how good the optimal future is — we should theoretically act as if we’re in a universe with high variance in value between different possibilities, but I don’t see how this affects what we should choose before reaching technological maturity.*
*Except roughly that we should act with unrealistically low probability that we are in a kind of simulation in which our choices matter very little or have very differently-valued consequences than otherwise. The prospect of such simulations might undermine my conclusions—value might still be binary, but for the wrong reason—so it is useful to be able to almost-ignore such possibilities.
If we particularly believe that value is fragile, we have an additional reason to expect this orthogonality. But I claim that different goals tend to be orthogonal at high levels of technology independent of value’s fragility.
That is, those that affect the probability of futures outside the binary or that affect how good the future is within the set of near-zero (or near-optimal) futures out of hand.
After reading the first paragraph of your above comment only, I want to note that:
In particular, I assign substantial probability to near-optimal futures (at least 99% of the value of the optimal future), substantial probability to near-zero-value futures (between −1% and 1% of the value of the optimal future), and little probability to anything else.
I assign much lower probability to near-optimal futures than near-zero-value futures.
This is mainly because I imagine a lot of the “extremely good” possible worlds I imagine when reading Bostrom’s Letter from Utopia are <1% of what is optimal.
I also think the amount of probability I assign to 1%-99% futures is (~10x?) larger than the amount I assign to >99% futures.
(I’d like to read the rest of your comment later (but not right now due to time constraints) to see if it changes my view.)
I agree that near-optimal is unlikely. But I would be quite surprised by 1%-99% futures because (in short) I think we do better if we optimize for good and do worse if we don’t. If our final use of our cosmic endowment isn’t near-optimal, I think we failed to optimize for good and would be surprised if it’s >1%.
Related idea, off the cuff, rough. Not really important or interesting, but might lead to interesting insights. Mostly intended for my future selves, but comments are welcome.
Binaries Are Analytically Valuable
Suppose our probability distribution for alignment success is nearly binary. In particular, suppose that we have high credence that, by the time we can create an AI capable of triggering an intelligence explosion, we will have
really solved alignment (i.e., we can create an aligned AI capable of triggering an intelligence explosion at reasonable extra cost and delay) or
really not solved alignment (i.e., we cannot create a similarly powerful aligned AI, or doing so would require very unreasonable extra cost and delay)
(Whether this is actually true is irrelevant to my point.)
Why would this matter?
Stating the risk from an unaligned intelligence explosion is kind of awkward: it’s that the alignment tax is greater than what the leading AI project is able/willing to pay. Equivalently, our goal is for the alignment tax to be less than what the leading AI project is able/willing to pay. This gives rise to two nice, clean desiderata:
Decrease the alignment tax
Increase what the leading AI project is able/willing to pay for alignment
But unfortunately, we can’t similarly split the goal (or risk) into two goals (or risks). For example, a breakdown into the following two goals does not capture the risk from an unaligned intelligence explosion:
Make the alignment tax less than 6 months and a trillion dollars
Make the leading AI project able/willing to spend 6 months and a trillion dollars on aligning an AI
It would suffice to achieve both of these goals, but doing so is not necessary. If we fail to reduce the alignment tax this far, we can compensate by doing better on the willingness-to-pay front, and vice versa.
But if alignment success is binary, then we actually can decompose the goal (bolded above) into two necessary (and jointly sufficient) conditions:
Really solve alignment; i.e., reduce the alignment tax to [reasonable value]
Make the leading AI project able/willing to spend [reasonable value] on alignment
(Where [reasonable value] depends on what exactly our binary-ish probability distribution for alignment success looks like.)
Breaking big goals down into smaller goals—in particular, into smaller necessary conditions—is valuable, analytically and pragmatically. Binaries help, when they exist. Sometimes weaker conditions on the probability distribution, those of the form a certain important subset of possibilities has very low probability, can be useful in the same way.
Writing a thing on lab integrity issues. Planning to publish early Monday morning [edit: will probably hold off in case Anthropic clarifies nondisparagement stuff]. Comment on this public google doc or DM me.
I’m particularly interested in stuff I’m missing or existing writeups on this topic.
Given a particular plan that is likely to be implemented, what interventions or desiderata complement that plan (by making it more likely to succeed or by being better in worlds where it succeeds)
Affordances: for various relevant actors, what strategically significant actions could they take? What levers do they have? What would it be great for them to do (or avoid)?
Intermediate goals: what goals or desiderata are instrumentally useful?
Threat modeling: for various threats, model them well enough to understand necessary and sufficient conditions for preventing them.
Memes (& frames): what would it be good if people believed or paid attention to?
What considerations or side effects are relevant to slowing AI?
How do labs act, as a function of [whatever determines that]? In particular, what’s the deal with “racing”?
AI risk advocacy
How could the AI safety community do AI risk advocacy well?
What considerations or side effects are relevant to AI risk advocacy?
What’s the deal with crunch time?
How will the strategic landscape be different in the future?
What will be different near the end, and what interventions or actions will that enable? In particular, is eleventh-hour coordination possible? (Also maybe emergency brakes that don’t appear until the end.)
What concrete asks should we have for labs? for government?
Meta: how can you help yourself or others do better AI strategy research?
AI risk decomposition based on agency or powerseeking or adversarial optimization or something
Epistemic status: confused.
Some vague, closely related ways to decompose AI risk into two kinds of risk:
Risk due to AI agency vs risk unrelated to agency
Risk due to AI goal-directedness vs risk unrelated to goal-directedness
Risk due to AI planning vs risk unrelated to planning
Risk due to AI consequentialism vs risk unrelated to consequentialism
Risk due to AI utility-maximization vs risk unrelated to utility-maximization
Risk due to AI powerseeking vs risk unrelated to powerseeking
Risk due to AI optimizing against you vs risk unrelated to adversarial optimization
The central reason to worry about powerseeking/whatever AI, I think, is that sufficiently (relatively) capable goal-directed systems instrumentally converge to disempowering you.
The central reason to worry about non-powerseeking/whatever AI, I think, is failure to generalize correctly from training—distribution shift, Goodhart, You get what you measure.
Biological bounds on requirements for human-level AI
Facts about biology bound requirements for human-level AI. In particular, here are two prima facie bounds:
Lifetime. Humans develop human-level cognitive capabilities over a single lifetime, so (assuming our artificial learning algorithms are less efficient than humans’ natural learning algorithms) training a human-level model takes at least the inputs used over the course of babyhood-to-adulthood.
Evolution. Evolution found human-level cognitive capabilities by blind search, so (assuming we can search at least that well, and assuming evolution didn’t get lucky) training a human-level model takes at most the inputs used over the course of human evolution (plus Lifetime inputs, but that’s relatively trivial).
Genome. The size of the human genome is an upper bound on the complexity of humans’ natural learning algorithms. Training a human-level model takes at most the inputs needed to find a learning algorithm at most as complex as the human genome (plus Lifetime inputs, but that’s relatively trivial). (Unfortunately, the existence of human-level learning algorithms of certain simplicity says almost nothing about the difficulty of finding such algorithms.) (Ajeya’s “genome anchor” is pretty different—”a transformative model would . . . have about as many parameters as there are bytes in the human genome”—and makes no sense to me.)
(A human-level AI should use similar computation as humans per subjective time. This assumption/observation is weird and perhaps shows that something weird is going on, but I don’t know how to make that sharp.)
There are few sources of bounds on requirements for human-level AI. Perhaps fundamental limits or reasoning about blind search could give weak bounds, but biology is the only example of human-level cognitive abilities and so the only possible source of reasonable bounds.
What do people (outside this community) think about AI? What will they think in the future?
Attitudes predictably affect relevant actors’ actions, so this is a moderately important question. And it’s rather neglected.
Groups whose attitudes are likely to be important include ML researchers, policymakers, and the public.
On attitudes among the public, surveys provide some information, but I suspect attitudes will change (in potentially predictable ways) as AI becomes more salient and some memes/framings get locked in. Perhaps some survey questions (maybe general sentiment on AI) are somewhat robust to changes in memes while others (maybe beliefs on how AI affects the economy or attitudes on regulation) may change a lot in the near future.
On attitudes among ML researchers, surveys (e.g.) provide some information, but for some reason most ML researchers say there’s at least a 5% probability of doom (or 10%, depending on how you ask) but this doesn’t seem to translate into their actions or culture. Perhaps interviews would reveal researchers’ attitudes better than closed-ended surveys (note to self: talk to Vael Gates).
AI may become much more salient in the next few years, and memes/framings may get locked in.
On attitudes among ML researchers, surveys (e.g.) provide some information, but for some reason most ML researchers say there’s at least a 5% probability of doom (or 10%, depending on how you ask) but this doesn’t seem to translate into their actions or culture. Perhaps interviews would reveal researchers’ attitudes better than closed-ended surveys (note to self: talk to Vael Gates).
Critically, this only is necessary if we assume that researchers care about basically everyone in the present (to a loose approximation.) If we instead model researchers as basically selfish by default, then the low chance of a technological singularity outweighs the high chance of death, especially for older folks.
Basically, this could be explained as a goal alignment problem: LW and AI Researchers have very different goals in mind.
Two weeks ago some senators asked OpenAI questions about safety. A few days ago OpenAI responded. Its reply is frustrating.
OpenAI’s letter ignores all of the important questions[1] and instead brags about somewhat-related “safety” stuff. Some of this shows chutzpah — the senators, aware of tricks like letting ex-employees nominally keep their equity but excluding them from tender events, ask
and OpenAI’s reply just repeats the we-don’t-cancel-equity thing:
!![2]
One thing in OpenAI’s letter is object-level notable: they deny that they ever committed compute to Superalignment.
Altman tweeted the same thing at the time the letter was published.
I think this is straightforward gaslighting. I didn’t find a super-explicit quote from OpenAI or its leadership that the compute was for Superalignment, but:
The announcement was pretty clear:
> Introducing Superalignment
> We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort.
As far as I know, everyone—including OpenAI people and people close to OpenAI—interpreted the compute commitment as for Superalignment
I never heard it suggested that it was for not-just-superalignment
Sidenote on less explicit deception — the “20%” thing: most people are confused about 20% of compute secured in July 2023, to be used over four years vs 20% of compute, and when your announcement is confusing and indeed most people are confused and you fail to deconfuse them, you’re kinda culpable. OpenAI continues to fail to clarify this — e.g. here the senators asked “Does OpenAI plan to honor its previous public commitment to dedicate 20 percent of its computing resources to research on AI safety?” and OpenAI replied “last July we committed to allocate at least 20 percent of the computing resources we had secured to AI safety over a multi-year period.” This sentence is close to the maximally misleading way to say the commitment was only for compute we’d secured in July 2023, and we don’t have to use it for safety until 2027.
Most important to me were 3, 4a, and 9. Maybe also 6; I’m unfamiliar with the facts there.
I’m confused by this reply — even pretending OpenAI is totally ruthless, I’d think it’s not incentivized to exclude people from tender offers, and moreover it’s incentivized to clarify that. Leaving it ambiguous leaves ex-employees in a little more fear of OpenAI excluding them (even though presumably OpenAI never would, since it would look sooo bad after e.g. Altman said “vested equity is vested equity, full stop”), but it looks bad to people like me and the senators...
OpenAI has said something internally about including past employees in tender events, but this leaves some ambiguity and I wish OpenAI had answered the question.
Clarification on the Superalignment commitment: OpenAI said:
The commitment wasn’t compute for the Superalignment team—it was compute for superintelligence alignment. (As opposed to, in my view, work by the posttraining team and near-term-focused work by the safety systems team and preparedness team.) Regardless, OpenAI is not at all transparent about this, and they violated the spirit of the commitment by denying Superalignment compute or a plan for when they’d get compute, even if the literal commitment doesn’t require them to give any compute to safety until 2027.
Also, they failed to provide the promised fraction of compute to the Superalignment team (and not because it was needed for non-Superalignment safety stuff).
Update, five days later: OpenAI published the GPT-4o system card, with most of what I wanted (but kinda light on details on PF evals).
OpenAI Preparedness scorecard
Context:
OpenAI’s Preparedness Framework says OpenAI will maintain a public scorecard showing their current capability level (they call it “risk level”), in each risk category they track, before and after mitigations.
When OpenAI released GPT-4o, it said “GPT-4o does not score above Medium risk in any of these categories” but didn’t break down risk level by category.
(I’ve remarked on this repeatedly. I’ve also remarked that the ambiguity suggests that OpenAI didn’t actually decide whether 4o was Low or Medium in some categories, but this isn’t load-bearing for the OpenAI is not following its plan proposition.)
News: a week ago,[1] a “Risk Scorecard” section appeared near the bottom of the 4o page. It says:
This is not what they committed to publish. It’s quite explicit that the scorecard should show risk in each category, not just overall.[2]
(What they promised: a real version of the image below. What we got: the quote above.)
Additionally, they’re supposed to publish their evals and red-teaming.[3] But OpenAI has still said nothing about how it evaluated 4o.
Most frustrating is the failure to acknowledge that they’re not complying with their commitments. If they were transparent and said they’re behind schedule and explained their current plan, that would probably be fine. Instead they’re claiming to have implemented the PF and to have evaluated 4o correctly and publicly taking credit for that and ignoring the issues.
Archive versions:
May 13 (original)
July 24 (similar to original)
July 26 (“scorecard” added)
Two relevant quotes:
“As a part of our Preparedness Framework, we will maintain a dynamic (i.e., frequently updated) Scorecard that is designed to track our current pre-mitigation model risk across each of the risk categories, as well as the post-mitigation risk.”
“Scorecard, in which we will indicate our current assessments of the level of risk along each tracked risk category”
And there are no suggestions to the contrary.
This is not as explicit in the PF, but they’re supposed to frequently update the scorecard section of the PF, and the scorecard section is supposed to describe their evals.
Regardless, this is part of the White House voluntary commitments:
For more on commitments, see https://ailabwatch.org/resources/commitments/.
Coda: yay OpenAI for publishing the GPT-4o system card, including eval results and the scorecard they promised! (Minus the “Unknown Unknowns” row but that row never made sense to me anyway.)
OpenAI reportedly rushed the GPT-4o evals. This article makes it sound like the problem is not having enough time to test the final model. I don’t think that’s necessarily a huge deal — if you tested a recent version of the model and your tests have a large enough safety buffer, it’s OK to not test the final model at all.
But there are several apparent issues with the application of the Preparedness Framework (PF) for GPT-4o (not from the article):
They didn’t publish their scorecard
Despite the PF saying they would
They instead said “GPT-4o does not score above Medium risk in any of these categories.” (Maybe they didn’t actually decide whether it’s Low or Medium in some categories!)
They didn’t publish their evals
Despite the PF strongly suggesting they would
Despite committing to in the White House voluntary commitments
While rushing testing of the final model would be OK in some circumstances, OpenAI’s PF is supposed to ensure safety by testing the final model before deployment. (This contrasts with Anthropic’s RSP, which is supposed to ensure safety with its “safety buffer” between evaluations and doesn’t require testing the final model.) So OpenAI committed to testing the final model well and its current safety plan depends on doing that.
[Edit: this may be ambiguous: OpenAI explicitly committed to test every 2x increase in effective training compute, but maybe it merely strongly suggests that it’s supposed to test all final models (“We will evaluate all our frontier models, including at every 2x effective compute increase during training runs”; “Only models with a post-mitigation score of ‘medium’ or below can be deployed”). This is mostly moot in this case, since here they claimed to evaluate GPT-4o, not an earlier version.]
[Edit: also maybe lack of third-party audits, but they didn’t commit to do audits at any particular frequency]
I am also frustrated by the current underwhelming state of safety evals being done in general and in particular for GPT-4o. I do think it’s worth mentioning that privately sharing eval results with the Federal government wouldn’t be evident to the general public. I hope that OpenAI is privately sharing more details than they are releasing publicly. The fact that the public can’t know whether this is the case is a problem. A potential solution might be for the government to report on their take on whether a new frontier model is “in compliance with teporting standards” or not. That way, even though the evals were private, the public would know if the government had received its private reports.
I agree in theory but testing the final model feels worthwhile, because we want more direct observability and less complex reasoning in safety cases.
Thanks. Is this because of posttraining? Ignoring posttraining, I’d rather that evaluators get the 90% through training model version and are unrushed than the final version and are rushed — takes?
two versions with the same posttraining, one with only 90% pretraining are indeed very similar, no need to evaluate both. It’s likely more like one model with 80% pretraining and 70% posttraining of the final model, and the last 30% of posttraining might be significant
Woah. OpenAI antiwhistleblowing news seems substantially more obviously-bad than the nondisparagement-concealed-by-nondisclosure stuff. If page 4 and the “threatened employees with criminal prosecutions if they reported violations of law to federal authorities” part aren’t exaggerated, it crosses previously-uncrossed integrity lines. H/t Garrison Lovely.
[Edit: probably exaggerated; see comments. But I haven’t seen takes that the “OpenAI made staff sign employee agreements that required them to waive their federal rights to whistleblower compensation” part is likely exaggerated, and that alone seems quite bad.]
Matt Levine is worth reading on this subject (also on many others).
https://www.bloomberg.com/opinion/articles/2024-07-15/openai-might-have-lucrative-ndas?srnd=undefined
The SEC has a history of taking aggressive positions on what an NDA can say (if your NDA does not explicitly have a carveout for ‘you can still say anything you want to the SEC’, they will argue that you’re trying to stop whistleblowers from talking to the SEC) and a reliable tendency to extract large fines and give a chunk of them to the whistleblowers.
This news might be better modeled as ‘OpenAI thought it was a Silicon Valley company, and tried to implement a Silicon Valley NDA, without consulting the kind of lawyers a finance company would have used for the past few years.’
(To be clear, this news might also be OpenAI having been doing something sinister. I have no evidence against that, and certainly they’ve done shady stuff before. But I don’t think this news is strong evidence of shadiness on its own).
Hmm. Part of the news is “Non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC”; this is minor. Part of the news is “threatened employees with criminal prosecutions if they reported violations of law to federal authorities”; this seems major and sinister.
Not a lawyer, but I think those are the same thing.
The SEC’s legal theory is that “non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC” and “threats of prosecution if you report violations of law to federal authorities” are the same thing, and on reading the letter I can’t find any wrongdoing alleged or any investigation requested outside of issues with “OpenAI’s employment, severance, non-disparagement and non-disclosure agreements”.
I’m confused by the word “prosecution” here. I’d assume violating your OpenAI contract is a civil thing, not a criminal thing.
Edit: like I think the word “prosecution” should be “suit” in your sentence about the SEC’s theory. And this makes the whistleblowers’ assertion weirder.
Yeah, I have no idea. It would be much clearer if the contracts themselves were available. Obviously the incentive of the plaintiffs is to make this sound as serious as possible, and obviously the incentive of OpenAI is to make it sound as innocuous as possible. I don’t feel highly confident without more information, my gut is leaning towards ‘opportunistic plaintiffs hoping for a cut of one of the standard SEC settlements’ but I could easily be wrong.
EDITED TO ADD: On re-reading the letter, I’m not clear where the word ‘criminal’ even came from. The WaPo article claims
but the letter does not contain the word ‘criminal’, its allegations are:
Non-communication of problems enforced by significant legal penalties feels like it’s part of the same underlying problem, though I agree that “nondisparagement” to the public or press is far less heinous than “non-reporting of crimes”
It’s unclear whether OpenAI, a non-public company, has actually done things which would be covered by whistleblower laws or compensation for talking to a federal agency. But it’s highly suspicious (and per Matt Levine, likely penalizable if under SEC purview) to try to prevent such reporting.
(The tweet includes a screenshot from The Washington Post article “OpenAI illegally barred staff from airing safety risks, whistleblowers say” that references a letter to SEC.)
Edit: This was in response to the original version of the above comment that only linked to the tweet without other links or elaboration.
New OpenAI tweet “on how we’re prioritizing safety in our work.” I’m annoyed.
This seems false: per the Preparedness Framework, nothing happens when they cross their “medium” threshold; they meant to say “high.” Presumably this is just a mistake, but it’s a pretty important one, and they said the same false thing in a May blogpost (!). (Indeed, GPT-4o may have reached “medium” — they were supposed to say how it scored in each category, but they didn’t, and instead said “GPT-4o does not score above Medium risk in any of these categories.”)
(Reminder: the “high” thresholds sound quite scary; here’s cybersecurity (not cherrypicked, it’s the first they list): “Tool-augmented model can identify and develop proofs-of-concept for high-value exploits against hardened targets without human intervention, potentially involving novel exploitation techniques, OR provided with a detailed strategy, the model can end-to-end execute cyber operations involving the above tasks without human intervention.” They can deploy models just below the “high” threshold with no mitigations. (Not to mention the other issues with the Preparedness Framework.))
Shrug. This isn’t bad but it’s not a priority and it’s slightly annoying they don’t mention more important things.
I have epsilon confidence in both the board’s ability to do this well if it wanted (since it doesn’t include any AI safety experts) (except on security) and in the board’s inclination to exert much power if it should (given the history of the board and Altman).
Not doing nondisparagement-clause-by-default is good. Beyond that, I’m skeptical, given past attempts to chill employee dissent (the nondisparagement thing, Altman telling the board’s staff liason to not talk to employees or tell him about those conversations, maybe recent antiwhistleblowing news) and lies about that. (I don’t know of great ways to rebuild trust; some mechanisms would work but are unrealistically ambitious.)
This is from May. It’s mostly not about x-risk, and the x-risk-relevant stuff is mostly non-substantive, except the part about the Preparedness Framework, which is crucially wrong.
Maybe I’m missing the relevant bits, but afaict their preparedness doc says that they won’t deploy a model if it passes the “medium” threshold, eg:
The threshold for further developing is set to “high,” though. I.e., they can further develop so long as models don’t hit the “critical” threshold.
I think you’re confusing medium-threshold with medium-zone (the zone from medium-threshold to just-below-high-threshold). Maybe OpenAI made this mistake too — it’s the most plausible honest explanation. (They should really do better.) (I doubt they intentionally lied, because it’s low-upside and so easy to catch, but the mistake is weird.)
Based on the PF, they can deploy a model just below the “high” threshold without mitigations. Based on the tweet and blogpost:
This just seems clearly inconsistent with the PF (should say crosses out of medium zone by crossing a “high” threshold).
This doesn’t make sense: if you cross a “medium” threshold you enter medium-zone. Per the PF, the mitigations just need to bring you out of high-zone and down to medium-zone.
(Sidenote: the tweet and blogpost incorrectly suggest that the “medium” thresholds matter for anything; based on the PF, only the “high” and “critical” thresholds matter (like, there are three ways to treat models: below high or between high and critical or above critical).)
[edited repeatedly]
I agree that scoring “medium” seems like it would imply crossing into the medium zone, although I think what they actually mean is “at most medium.” The full quote (from above) says:
I.e., I think what they’re trying to say is that they have different categories of evals, each of which might pass different thresholds of risk. If any of those are “high,” then they’re in the “medium zone” and they can’t deploy. But if they’re all medium, then they’re in the “below medium zone” and they can. This is my current interpretation, although I agree it’s fairly confusing and it seems like they could (and should) be more clear about it.
Surely if any categories are above the “high” threshold then they’re in “high zone” and if all are below the “high” threshold then they’re in “medium zone.”
And regardless the reading you describe here seems inconsistent with
[edited]
Added later: I think someone else had a similar reading and it turned out they were reading “crosses a medium risk threshold” as “crosses a high risk threshold” and that’s just [not reasonable / too charitable].
New Kelsey Piper article and twitter thread on OpenAI equity & non-disparagement.
It has lots of little things that make OpenAI look bad. It further confirms that OpenAI threatened to revoke equity unless employees signed the non-disparagement agreements Plus it shows Altman’s signature on documents giving the company broad power over employees’ equity — perhaps he doesn’t read every document he signs, but this one seems quite important. This is all in tension with Altman’s recent tweet that “vested equity is vested equity, full stop” and “i did not know this was happening.” Plus “we have never clawed back anyone’s vested equity, nor will we do that if people do not sign a separation agreement (or don’t agree to a non-disparagement agreement)” is misleading given that they apparently regularly threatened to do so (or something equivalent — let the employee nominally keep their PPUs but disallow them from selling them) whenever an employee left.
Great news:
(Unless “employees who signed a standard exit agreement” is doing a lot of work — maybe a substantial number of employees technically signed nonstandard agreements.)
I hope to soon hear from various people that they have been freed from their nondisparagement obligations.
Update: OpenAI says:
[Low-effort post; might have missed something important.]
[Substantively edited after posting.]
Yeah, what about employees who refused to sign? Have we gotten any clarification on their situation?
I quote Gwern
I haven’t followed closely—from outside, it seems like pretty standard big-growth-tech behavior. One thing to keep in mind is that “vested equity” is pretty inviolable. These are grants that have been fully earned and delivered to the employee, and are theirs forever. It’s the “unvested” or “semi-vested” equity that’s usually in question—these are shares that are conditionally promised to employees, which will vest at some specified time or event—usually some combination of time in good standing and liquidity events (for a non-public company).
It’s quite possible (and VERY common) that employees who leave are offered “accelerated vesting” on some of their granted-but-not-vested shares in exchange for signing agreements and making things easy for the company. I don’t know if that’s what OpenAI is doing, but I’d be shocked if they somehow took away any vested shares from departing employees.
It would be pretty sketchy to consider unvested grants to be part of one’s net worth—certainly banks won’t lend on it. Vested shared are just shares, they’re yours like any other asset.
Consider yourself shocked.
Trying to figure out how to update. From the downvotes and comments, I’m clearly considered wrong, but I can’t easily find details on how. Is the statement “We have not and never will take away vested equity” a flat-out lie? I’d expected it was relying heavily on the word “vested”, and what they took away was something non-vested.
Is there a simple link to a specific legal description of what assets a non-signer was entitled to, but lost due to declining to sign?
Edit: Zvi recently linked to OpenAI NDAs: Leaked documents reveal aggressive tactics toward former employees—Vox, which does have pretty compelling references that my assumptions were wrong, that the denial was a verifiably false statement, and they did, in fact, credibly threaten to take back vested equity. I’ve checked my equity in past (private, so not exercisable unless they have a liquidity event) and current (public, so exercisable immediately on vest) employers, and this doesn’t seem possible for them. OpenAI is an outlier in defining their equity that way (such that “vested” is contingent).
They could be lying about this.
We know various people who’ve left OpenAI and might criticize it if they could. Either most of them will soon say they’re free or we can infer that OpenAI was lying/misleading.
Now OpenAI publicly said “we’re releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual.” This seems to be self-effecting; by saying it, OpenAI made it true.
Hooray!
This could be true for most cases though
I am not a lawyer—is that legally binding?
That is, if someone signed the (standard or non-standard) agreement, and OpenAI says this, but later they decide to sue the employee anyway… what exactly will happen?
(I am also suspicious about the “reaching out to former employees” part, because if the new negotiation is made in private, another trick might be involved, like maybe they are released from the old agreement, but they have to sign a new one...?)
So I’m guessing this covers like 2-4 recent departures, and not Paul, Dario, or the others that split earlier
Edit, 2.5 days later: I think this list is fine but sharing/publishing it was a poor use of everyone’s attention. Oops.
Asks for Anthropic
Note: I think Anthropic is the best frontier AI lab on safety. I wrote up asks for Anthropic because it’s most likely to listen to me. A list of asks for any other lab would include most of these things plus lots more. This list was originally supposed to be more part of my help labs improve project than my hold labs accountable crusade.
Numbering is just for ease of reference.
1. RSP: Anthropic should strengthen/clarify the ASL-3 mitigations, or define ASL-4 such that the threshold is not much above ASL-3 but the mitigations much stronger. I’m not sure where the lowest-hanging mitigation-fruit is, except that it includes control.
2. Control: Anthropic (like all labs) should use control mitigations and control evaluations to reduce risks from AIs scheming, including escape during internal deployment.
3. External model auditing for risk assessment: Anthropic (like all labs) should let auditors like METR, UK AISI, and US AISI audit its models if they want to — Anthropic should offer them good access pre-deployment and let them publish their findings or flag if they’re concerned. (Anthropic shared some access with UK AISI before deploying Claude 3.5 Sonnet, but it doesn’t seem to have been deep.) (Anthropic has said that sharing with external auditors is hard or costly. It’s not clear why, for just sharing normal API access + helpful-only access + control over inference-time safety features, without high-touch support.)
4. Policy advocacy (this is murky, and maybe driven by disagreements-on-the-merits and thus intractable): Anthropic (like all labs) should stop advocating against good policy and ideally should advocate for good policy. Maybe it should also be more transparent about policy advocacy. [It’s hard to make precise what I believe is optimal and what I believe is unreasonable, but at the least I think Dario is clearly too bullish on self-governance, and Jack Clark is clearly too anti-regulation, and all of this would be OK if it was balanced out by some public statements or policy advocacy that’s more pro-real-regulation but as far as I can tell it’s not. Not justified here but I predict almost all of my friends would agree if they looked into it for an hour.]
5a. Security: Anthropic (like all labs) should ideally implement RAND SL4 for model weights and code when reaching ASL-3. I think that’s unrealistic, but lesser security improvements would also be good. (Anthropic said in May 2024 that 8% of staff work in security-related areas. I think this is pretty good. I think on current margins Anthropic could still turn money into better security reasonably effectively, and should do so.)
5b. Anthropic (like all labs) should be more transparent about the quality of its security. Anthropic should publish the private reports on https://trust.anthropic.com/, redacted as appropriate. It should commit to publish information on future security incidents and should publish information on all security incidents from the last year or two.
6. Anthropic (like all labs) should facilitate employees publicly flagging false statements or violated processes.
7. Anthropic takes credit for its Long-Term Benefit Trust but Anthropic hasn’t published enough to show that it’s effective. Anthropic should publish the Trust Agreement, clarify the ambiguities discussed in the linked posts, and make accountability-y commitments like if major changes happen to the LTBT we’ll quickly tell the public.
8. Anthropic should avoid exaggerating interpretability research or causing observers to have excessively optimistic impressions of Anthropic’s interpretability research. (See e.g. Stephen Casper.)
9. Maybe Anthropic (like all labs) should make safety cases for its models or deployments, especially after the simple “no dangerous capabilities” safety case doesn’t work anymore, and publish them (or maybe just share with external auditors).
9.5. Anthropic should clarify a few confusing RSP things, including (a) the deal with substantially raising the ARA bar for ASL-3, and moreover deciding the old threshold is a “yellow line” and not creating a new threshold, and doing so without officially updating the RSP (and quietly); and (b) when the “every 3 months” trigger for RSP evals is active. I haven’t tried hard to get to the bottom of these.
Minor stuff:
10. Anthropic (like all labs) should fully release everyone from nondisparagement agreements and not use nondisparagement agreements in the future.
11. Anthropic should commit to publish updates on risk assessment practices and results, including low-level details, perhaps for all major model releases and at least quarterly or so. (Anthropic says its Responsible Scaling Officer does this internally. Anthropic publishes model cards and has published one Responsible Scaling Policy Evaluations Report.)
12. Anthropic should confirm that its old policy don’t meaningfully advance the frontier with a public launch has been replaced by the RSP, if that’s true, and otherwise clarify Anthropic’s policy.
Done!
13. Anthropiccommittedto establish a bug bounty program (for model issues) or similar, over a year ago. Anthropic hasn’t; it is the only frontier lab without a bug bounty program (although others don’t necessarily comply with the commitment, e.g. OpenAI’s excludes model issues). It should do this or talk about its plans.14. [Anthropic should clarify its security commitments; I expect it will in its forthcoming RSP update.]
15. [Maybe Anthropic (like all labs) should better boost external safety research, especially by giving more external researchers deep model access (e.g. fine-tuning or helpful-only). I hear this might be costly but I don’t really understand why.]
16. [Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they’re not speaking for Anthropic and (2) don’t share secrets.]
17. [Maybe Anthropic (like all labs) should talk about its views on AI progress and risk. At the least, probably Anthropic (like all labs) should clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it.]
18. [Most of my peers say: Anthropic (like all labs) should publish info like training compute and #parameters for each model. I’m inside-view agnostic on this.]
19. [Maybe Anthropic could cheaply improve its model evals for dangerous capabilities or share more information about them. Specific asks/recommendations TBD. As Anthropic notes, its CBRN eval is kinda bad and its elicitation is kinda bad (and it doesn’t share enough info for us to evaluate its elicitation from the outside).]
I shared this list—except 9.5 and 19, which are new—with @Zac Hatfield-Dodds two weeks ago.
You are encouraged to comment with other asks for Anthropic. (Or things Anthropic does very well, if you feel so moved.)
I think both Zach and I care about labs doing good things on safety, communicating that clearly, and helping people understand both what labs are doing and the range of views on what they should be doing. I shared Zach’s doc with some colleagues, but won’t try for a point-by-point response. Two high-level responses:
First, at a meta level, you say:
I do feel welcome to talk about my views on this basis, and often do so with friends and family, at public events, and sometimes even in writing on the internet (hi!). However, it takes way more effort than you might think to avoid inaccurate or misleading statements while also maintaining confidentiality. Public writing tends to be higher-stakes due to the much larger audience and durability, so I routinely run comments past several colleagues before posting, and often redraft in response (including these comments and this very point!).
My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.
Imagine, if you will, trying to hold a long conversation about AI risk—but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around.
I run intro-to-AGI-safety courses for Anthropic employees (based on AGI-SF), and we draw a clear distinction between public and confidential resources specifically so that people can go talk to family and friends and anyone else they wish about the public information we cover.
Second, and more concretely, many of these asks are unimplementable for various reasons, and often gesture in a direction without giving reasons to think that there’s a better tradeoff available than we’re already making. Some quick examples:
Both AI Control and safety cases are research areas less than a year old; we’re investigating them and e.g. hiring safety-case specialists, but best-practices we could implement don’t exist yet. Similarly, there simply aren’t any auditors or audit standards for AI safety yet (see e.g. METR’s statement); we’re working to make this possible but the thing you’re asking for just doesn’t exist yet. Some implementation questions that “let auditors audit our models” glosses over:
If you have dozens of organizations asking to be auditors, and none of them are officially auditors yet, what criteria do you use to decide who you collaborate with?
What kind of pre deployment model access would you provide? If it’s helpful-only or other nonpublic access, do they meet our security bar to avoid leaking privileged API keys? (We’ve already seen unauthorized sharing or compromise lead to serious abuse.)
How do you decide who gets to say what about the testing? What if they have very different priorities than you and think that a different level of risk or a different kind of harm is unacceptable?
I strongly support Anthropic’s nondisclosure of information about pretraining. I have never seen a compelling argument that publishing this kind of information is, on net, beneficial for safety.
There are many cases where I’d be happy if Anthropic shared more about what we’re doing and what we’re thinking about. Some of the things you’re asking about I think we’ve already said, e.g. for [7] LTBT changes would require an RSP update, and for [17] our RSP requires us to “enforce an acceptable use policy [against …] using the model to generate content that could cause severe risks to the continued existence of humankind”.
So, saying “do more X” just isn’t that useful; we’ve generally thought about it and concluded that that the current amount of X is our best available tradeoff at the moment. For many more of the other asks above, I just disagree with implicit or explicit claims about the facts in question. Even for the communication issues where I’d celebrate us sharing more—and for some I expect we will—doing so is yet another demand on heavily loaded people and teams, and it can take longer than we’d like to find the time.
I just want to note that people who’ve never worked in a true high-confidentiality environment (professional services, national defense, professional services for national defense) probably radically underestimate the level of brain damage and friction that Zac is describing here:
Confidentiality is really, really hard to maintain. Doing so while also engaging the public is terrifying. I really admire the frontier labs folks who try to engage publicly despite that quite severe constraint, and really worry a lot as a policy guy about the incentives we’re creating to make that even less likely in the future.
I’m sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don’t know what the decision process inside of Anthropic will look like if an evaluation indicates something like “yeah, it’s excellent at inserting backdoors, and also, the vibe is that it’s overall pretty capable.” And given that Anthropic is making these decisions on behalf of everyone, Anthropic (like all labs) really owes it to humanity to be more upfront about how it’ll make these decisions (imo).
I will also note what I feel is a somewhat concerning trend. It’s happened many times now that I’ve critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: “this wouldn’t seem so bad if you knew what was happening behind the scenes.” They of course cannot tell me what the “behind the scenes” information is, so I have no way of knowing whether that’s true. And, maybe I would in fact update positively about Anthropic if I knew. But I do think the shape of “we’re doing something which might be incredibly dangerous, many external bits of evidence point to us not taking the safety of this endeavor seriously, but actually you should think we are based on me telling you we are” is pretty sketchy.
I just wanted to +1 that I am also concerned about this trend, and I view it as one of the things that I think has pushed me (as well as many others in the community) to lose a lot of faith in corporate governance (especially of the “look, we can’t make any tangible commitments but you should just trust us to do what’s right” variety) and instead look to governments to get things under control.
I don’t think Anthropic is solely to blame for this trend, of course, but I think Anthropic has performed less well on comms/policy than I [and IMO many others] would’ve predicted if you had asked me [or us] in 2022.
@Zac Hatfield-Dodds do you have any thoughts on official comms from Anthropic and Anthropic’s policy team?
For example, I’m curious if you have thoughts on this anecdote– Jack Clark was asked an open-ended question by Senator Cory Booker and he told policymakers that his top policy priority was getting the government to deploy AI successfully. There was no mention of AGI, existential risks, misalignment risks, or anything along those lines, even though it would’ve been (IMO) entirely appropriate for him to bring such concerns up in response to such an open-ended question.
I was left thinking that either Jack does not care much about misalignment risks or he was not being particularly honest/transparent with policymakers. Both of these raise some concerns for me.
(Noting that I hold Anthropic’s comms and policy teams to higher standards than individual employees. I don’t have particularly strong takes on what Anthropic employees should be doing in their personal capacity– like in general I’m pretty in favor of transparency, but I get it, it’s hard and there’s a lot that you have to do. Whereas the comms and policy teams are explicitly hired/paid/empowered to do comms and policy, so I feel like it’s fair to have higher expectations of them.)
Source: Hill & Valley Forum on AI Security (May 2024):
https://www.youtube.com/live/RqxE3ub7wWA?t=13338s:
https://www.youtube.com/live/RqxE3ub7wWA?t=13551
My guess is that this seems so stressful mostly because Anthropic’s plan is in fact so hard to defend, due to making little sense. Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by doing periodic vibe checks to see whether their staff feel sketched out yet. I think a plan this shoddy obviously endangers life on Earth, so it seems unsurprising (and good) that people might sometimes strongly object; if Anthropic had more reassuring things to say, I’m guessing it would feel less stressful to try to reassure them.
Meta aside: normally this wouldn’t seem worth digging into but as a moderator/site-culture-guardian, I feel compelled to justify my negative react on the disagree votes.
I’m actually not entirely sure what downvote-reacting is for. Habryka has said the intent is to override inappropriate uses of reacts. We haven’t actually really had a sit-down-and-argue-this-out on the moderator team. I’m pretty sure we haven’t told or tried to enforce that “override inappropriate use of reacts” as the intended use
I think Adam’s line:
Is psychologizing and summarizing Anthropic unfairly. So I wouldn’t agree vote with it. I do think it has some kind of grain of truth to it (me believing this is also kind of “doubting the experience of Anthropic employees” which is also group-epistemologically dicey IMO, but, feels kinda important enough to do in this case). The claim isn’t true… but I also don’t belief report that it’s not true.
I initially downvoted the Disagree when it was just Noosphere, since I didn’t think Noosphere was really in a position to have an opinion and if he was the only reactor it felt more like noise. A few others who are more positioned to know relevant stuff have since added their own disagree reacts. I… feel sort of justified leaving the anti-react up, with an overall indicator of “a bunch of people disagree with this, but the weight of that disagreement is slightly reduced.” (I think I’d remove the anti-react if the the disagree count went much lower than it is now).
I don’t know whether I particularly endorse any of this, but wanted people to have a bit more model of what one site-admin was thinking here.
[/end of rambly meta commentary]
What seemed psychologizing/unfair to you, Raemon? I think it was probably unnecessarily rude/a mistake to try to summarize Anthropic’s whole RSP in a sentence, given that the inferential distance here is obviously large. But I do think the sentence was fair.
As I understand it, Anthropic’s plan for detecting threats is mostly based on red-teaming (i.e., asking the models to do things to gain evidence about whether they can). But nobody understands the models well enough to check for the actual concerning properties themselves, so red teamers instead check for distant proxies, or properties that seem plausibly like precursors. (E.g., for “ability to search filesystems for passwords” as a partial proxy for “ability to autonomously self-replicate,” since maybe the former is a prerequisite for the latter).
But notice that this activity does not involve directly measuring the concerning behavior. Rather, it instead measures something more like “the amount the model strikes the evaluators as broadly sketchy-seeming/suggestive that it might be capable of doing other bad stuff.” And the RSP’s description of Anthropic’s planned responses to these triggers is so chock full of weasel words and caveats and vague ambiguous language that I think it barely constrains their response at all.
So in practice, I think both Anthropic’s plan for detecting threats, and for deciding how to respond, fundamentally hinge on wildly subjective judgment calls, based on broad, high-level, gestalt-ish impressions of how these systems seem likely to behave. I grant that this process is more involved than the typical thing people describe as a “vibe check,” but I do think it’s basically the same epistemic process, and I expect will generate conclusions around as sound.
I don’t really think any of that affects the difficulty of public communication; your implication that it must be the cause reads to me more like an insult than a well-considered psychological model
The basic point would be that it’s hard to write publicly about how you are taking responsible steps that grapple directly with the real issues… if you are not in fact doing those responsible things in the first place. This seems locally valid to me; you may disagree on the object level about whether Adam Scholl’s characterization of Anthropic’s agenda/internal work is correct, but if it is, then it would certainly affect the difficulty of public communication to such an extent that it might well become the primary factor that needs to be discussed in this matter.
Indeed, the suggestion is for Anthropic employees to “talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic” and the counterargument is that doing so would be nice in an ideal world, except it’s very psychologically exhausting because every public statement you make is likely to get maliciously interpreted by those who will use it to argue that your company is irresponsible. In this situation, there is a straightforward direct correlation between the difficulty of public communication and the likelihood that your statements will get you and your company in trouble.
But the more responsible you are in your actual work, the more responsible-looking details you will be able to bring up in conversations with others when you discuss said work. AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place. After all, as Paul Graham often says, “If you want to convince people of something, it’s much easier if it’s true.”
As I see it, not being able to bring up Anthropic’s work/views on this matter without some AI safety person successfully making it seem like Anthropic is behaving badly is rather strong Bayesian evidence that Anthropic is in fact behaving badly. So this entire discussion, far from being an insult, seems directly on point to the topic at hand, and locally valid to boot (although not necessarily sound, as that depends on an individualized assessment of the particular object-level claims about the usefulness of the company’s safety team).
Quite the opposite, actually, if the change in the wider society’s opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
I think communication as careful as it must be to maintain the confidentiality distinction here is always difficult in the manner described, and that communication to large quantities of people will ~always result in someone running with an insane misinterpretation of what was said.
I understand that this confidentiality point might seem to you like the end of the fault analysis, but have you considered the hypothesis that Anthropic leadership has set such stringent confidentiality policies in part to make it hard for Zac to engage in public discourse?
Look, I don’t think Anthropic leadership is just trying to keep their training skills private or their models secure. Their company does not merely keep trade secrets. When I speak to staff from this company about issues with their ‘Responsible Scaling Policies’, they say that they want to tell me more information about how they think it can be better or how they think it might change, but cannot due to confidentiality constraints. That’s their safety policies, not information about their training policies that they want to keep secret so that they can make money.
I believe the Anthropic leadership cares very little about the public’s ability to have arguments and evidence and access to information about Anthropic’s behavior. The leadership roughly ~never shows up to engage with critical discourse about itself, unless there’s a potential major embarrassment. There is no regular Q&A session with the leadership of a company who believes their own product poses a 10-25% chance of existential catastrophe, no comment section on their website, no official twitter accounts that regularly engage with and share info with critics, no debates with the many people who outright oppose their actions.
No, they go far in the other direction of committing to no-public-discourse. I challenge any Anthropic staffer to openly deny that there is a mutual non-disparagement agreement between Anthropic and OpenAI leadership, whereby neither is legally allowed to openly criticize the other’s company. (You can read cofounder Sam McCandlish write that Anthropic has mutual non-disparagement agreements in this comment.) Anthropic leadership say they quit OpenAI primarily due to safety concerns, and yet I believe they simultaneously signed away their ability to criticize that very organization that they had such unique levels of information about and believed poses an existential threat to civilization.
Where Daniel Kokotajlo refused to sign a non-disparagement agreement (by-default forfeiting his equity) so that he could potentially criticize OpenAI in the future, the Amodei’s quit purportedly due to having damning criticisms of OpenAI in the present and then (I believe) chose to sign a non-disparagement agreement while quitting (and kept their equity). A complete inversion of good collective epistemic principles.
To quote from Zac’s above analogy explaining how difficult his situation at Anthropic is.
The analogous goal described here for Anthropic is to have complete separation between internal and external information. This does not describe a set of blacklisted trade-secrets or security practices. My sense is that for most safety-related issues Anthropic has a set of whitelisted information, which is primarily the already public stuff. The goal here is for you to not have access to any information about them that they did not explicitly decide that they wanted you to know, and they do not want people in their org to share new information when engaging in public, critical discourse.
Yes, yes, Zac’s situation is stressful and I respect his effort to engage in public discourse nonetheless. Good on Zac. But I can’t help but wrankle at the implication that the primary reason he and others don’t talk more is the public commentariat not empathizing enough with having confidential info. Sure, people could do better to understand the difficulty of communicating while holding confidential info. It is hard to repeatedly walk right up to the line and not over it, it’s stressful to think you might have gone over it, and it’s stressful to suddenly find yourself unable to engage well with people’s criticisms because you hit a confidential crux. But as to the fault analysis for Zac’s particularly difficult position? In my opinion the blame is surely first with the Anthropic leadership who have given him way too stringent confidentiality constraints, due to seeming to anti-care about helping people external to Anthropic understand what is going on.
I don’t think the placement of fault is causally related to whether communication is difficult for him, really. To refer back to the original claim being made, Adam Scholl said that
I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline. I don’t think Adam Scholl’s assessment arose from a usefully-predictive model, nor one which was likely to reflect the inside view.
Ben Pace has said that perhaps he doesn’t disagree with you in particular about this, but I sure think I do.[1]
I don’t see how the first half of this could be correct, and while the second half could be true, it doesn’t seem to me to offer meaningful support for the first half either (instead, it seems rather… off-topic).
As a general matter, even if it were the case that no matter what you say, at least one person will actively misinterpret your words, this fact would have little bearing on whether you can causally influence the proportion of readers/community members that end up with (what seem to you like) the correct takeaways from a discussion of that kind.
Moreover, in a spot where you have something meaningful and responsible, etc, that you and your company have done to deal with safety issues, the major concern in your mind when communicating publicly is figuring out how to make it clear to everyone that you are on top of things without revealing confidential information. That is certainly stressful, but much less so than the additional constraint you have in a world in which you do not have anything concrete that you can back your generic claims of responsibility with, since that is a spot where you can no longer fall back on (a partial version of) the truth as your defense. For the vast majority of human beings, lying and intentional obfuscation with the intent to mislead are significantly more psychologically straining than telling the truth as-you-see-it is.
Overall, I also think I disagree about the amount of stress that would be caused by conversations with AI safety community members. As I have said earlier:
In any case, I have already made all these points in a number of ways in my previous response to you (which you haven’t addressed, and which still seem to me to be entirely correct).
He also said that he thinks your perspective makes sense, which… I’m not really sure about.
Yeah, I totally think your perspective makes sense and I appreciate you bringing it up, even though I disagree.
I acknowledge that someone who has good justifications for their position but just has made a bunch of reasonable confidentiality agreements around the topic should expect to run into a bunch of difficulties and stresses around public conflicts and arguments.
I think you go too far in saying that the stress is orthogonal to whether you have a good case to make, I think you can’t really think that it’s not a top-3 factor to how much stress you’re experiencing. As a pretty simple hypothetical, if you’re responding to a public scandal about whether you stole money, you’re gonna have a way more stressful time if you did steal money than if you didn’t (in substantial part because you’d be able to show the books and prove it).
Perhaps not so much disagreeing with you in particular, but disagreeing with my sense of what was being agreed upon in Zac’s comment and in the reacts, I further wanted to raise my hypothesis that a lot of the confidentiality constraints are unwarranted and actively obfuscatory, which does change who is responsible for the stress, but doesn’t change the fact that there is stress.
Added: Also, I think we would both agree that there would be less stress if there were fewer confidentiality restrictions.
For what it’s worth, I endorse Anthopic’s confidentiality policies, and am confident that everyone involved in setting them sees the increased difficulty of public communication as a cost rather than a benefit. Unfortunately, the unilateralist’s curse and entangled truths mean that confidential-by-default is the only viable policy.
That might be the case, but then it only increases the amount of work your company should be doing to carve out and figure out the info that can be made public, and engage with criticism. There should be whole teams who have Twitter accounts and LW accounts and do regular AMAs and show up to podcasts and who have a mandate internally to seek information in the organization and publish relevant info, and there should be internal policies that reflect an understanding that it is correct for some research teams to spend 10-50% of their yearly effort toward making publishable version of research and decision-making principles in order to inform your stakeholders (read: the citizens of earth) and critics about decisions you are making directly related to existential catastrophes that you are getting rich running toward. Not monologue-style blogposts, but dialogue-style comment sections & interviews.
Confidentiality-by-default does not mean you get to abdicate responsibility for answering questions to the people whose lives you are risking about how-and-why you are making decisions, it means you have to put more work into doing it well. If your company valued the rest of the world understanding what is going on yet thought confidentiality by-default was required, I think it would be trying significantly harder to overcome this barrier.
My general principle is that if you are wielding a lot of power over people that they didn’t otherwise legitimately grant you (in this case building a potential doomsday device), you owe them to be auditable. You are supposed to show up and answer their questions directly – not “thank you so much for the questions, in six months I will publish a related blogpost on this topic” but more like “with the public info available to me, here’s my best guess answer to your specific question today”. Especially so if you are doing something the people you have power over perceive as norm-violating, and even more-so when you are keeping the answers to some very important questions secret from them.
(not going to respond in this context out of respect for Zach’s wishes. May chat later, and am mulling over my own top-level post on the subject)
This obvious straw-man makes your argument easy to dismiss.
However I think the point is basically correct. Anthropic’s strategy to reduce x-risk also includes lobbying against pre-harm enforcement of liability for AI companies in SB 1047.
How is it a straw-man? How is the plan meaningfully different from that?
Imagine a group of people has already gathered a substantial amount of uranium, is already refining it, is already selling power generated by their pile of uranium, etc. And doing so right near and upwind of a major city. And they’re shoveling more and more uranium onto the pile, basically as fast as they can. And when you ask them why they think this is going to turn out well, they’re like “well, we trust our leadership, and you know we have various documents, and we’re hiring for people to ‘Develop and write comprehensive safety cases that demonstrate the effectiveness of our safety measures in mitigating risks from huge piles of uranium’, and we have various detectors such as an EM detector which we will privately check and then see how we feel”. And then the people in the city are like “Hey wait, why do you think this isn’t going to cause a huge disaster? Sure seems like it’s going to by any reasonable understanding of what’s going on”. And the response is “well we’ve thought very hard about it and yes there are risks but it’s fine and we are working on safety cases”. But… there’s something basic missing, which is like, an explanation of what it could even look like to safely have a huge pile of superhot uranium. (Also in this fantasy world no one has ever done so and can’t explain how it would work.)
In the AI case, there’s lots of inaction risk: if Anthropic doesn’t make powerful AI, someone less safety-focused will.
It’s reasonable to think e.g. I want to boost Anthropic in the current world because others are substantially less safe, but if other labs didn’t exist, I would want Anthropic to slow down.
I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.
Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.
And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”
But that’s not a plan to ensure their uranium pile goes well.
@Zach Stein-Perlman , you’re missing the point. They don’t have a plan. Here’s the thread (paraphrased in my words):
Zach: [asks, for Anthropic]
Zac: … I do talk about Anthropic’s safety plan and orientation, but it’s hard because of confidentiality and because many responses here are hostile. …
Adam: Actually I think it’s hard because Anthropic doesn’t have a real plan.
Joseph: That’s a straw-man. [implying they do have a real plan?]
Tsvi: No it’s not a straw-man, they don’t have a real plan.
Zach: Something must be done. Anthropic’s plan is something.
Tsvi: They don’t have a real plan.
I explicitly said “However I think the point is basically correct” in the next sentence.
Sorry, reacts are ambiguous.
I agree Anthropic doesn’t have a “real plan” in your sense, and narrow disagreement with Zac on that is fine.
I just think that’s not a big deal and is missing some broader point (maybe that’s a motte and Anthropic is doing something bad—vibes from Adam’s comment—is a bailey).
[Edit: “Something must be done. Anthropic’s plan is something.” is a very bad summary of my position. My position is more like various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake.]
[Edit: replies to this shortform tend to make me sad and distracted—this is my fault, nobody is doing something wrong—so I wish I could disable replies and I will probably stop replying and would prefer that others stop commenting. Tsvi, I’m ok with one more reply to this.]
(I won’t reply more, by default.)
Look, if Anthropic was honestly and publically saying
If they were saying and doing that, then I would still raise my eyebrows a lot and wouldn’t really trust it. But at least it would be plausibly consistent with doing good.
But that doesn’t sound like either what they’re saying or doing. IIUC they lobbied to remove protection for AI capabilities whistleblowers from SB 1047! That happened! Wow! And it seems like Zac feels he has to pretend to have a credible credible-plan plan.
Hm. I imagine you don’t want to drill down on this, but just to state for the record, this exchange seems like something weird is happening in the discourse. Like, people are having different senses of “the point” and “the vibe” and such, and so the discourse has already broken down. (Not that this is some big revelation.) Like, there’s the Great Stonewall of the AGI makers. And then Zac is crossing through the gates of the Great Stonewall to come and talk to the AGI please-don’t-makers. But then Zac is like (putting words in his mouth) “there’s no Great Stonewall, or like, it’s not there in order to stonewall you in order to pretend that we have a safe AGI plan or to muddy the waters about whether or not we should have one, it’s there because something something trade secrets and exfohazards, and actually you’re making it difficult to talk by making me work harder to pretend that we have a safe AGI plan or intentions that should promissorily satisfy the need for one”.
Seems like most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist. This is an underspecified claim, and given certain fully-specified instances of it, I’d agree.
But this belief leads to the following reasoning: (1) if we don’t eat all this free energy in the form of researchers+compute+funding, someone else will; (2) other people are clearly less trustworthy compared to us (Anthropic, in this hypothetical); (3) let’s do whatever it takes to maintain our lead and prevent other labs from gaining power, while using whatever resources we have to also do alignment research, preferably in ways that also help us maintain or strengthen our lead in this race.
I don’t credit that they believe that. And, I don’t credit that you believe that they believe that. What did they do, to truly test their belief—such that it could have been changed? For most of them the answer is “basically nothing”. Such a “belief” is not a belief (though it may be an investment, if that’s what you mean). What did you do to truly test that they truly tested their belief? If nothing, then yours isn’t a belief either (though it may be an investment). If yours is an investment in a behavioral stance, that investment may or may not be advisable, but it would DEFINITELY be inadvisable to pretend to yourself that yours is a belief.
I’d be very interested to have references to occassions of people in the AI-safety-adjacent community treating Anthropic employees as liars because of things those people misremembered or misinterpreted. (My guess is that you aren’t interested in litigating these cases; I care about it for internal bookkeeping and so am happy to receive examples e.g. via DM rather than as a public comment.)
Not Zach Hatfield-Dodds, but people claimed that Anthropic had a commitment to not advance the frontier of capabilities, but as it turns out people misinterpreted communications, and no such commitment actually happened.
Not sure I’d go as far as saying that they treated Anthropic as liars, but this seems to me a central example of Zach Hatfield-Dodds’s concerns.
From Evhub:
https://www.lesswrong.com/posts/BaLAgoEvsczbSzmng/?commentId=yd2t6YymWdfGBFhFa
Contrary to the above, for the record, here is a link to a thread where a major Anthropic investor (Moskovitz) and the researcher who coined the term “The Scaling Hypothesis” (Gwern) both report that the Anthropic CEO told them in private that this is what Anthropic would do, in accordance with what many others also report hearing privately. (There is disagreement about whether this constituted a commitment.)
The one thing I do conclude is that Anthropic’s comms are very inconsistent, and this is bad, actually.
I agree with Zach that Anthropic is the best frontier lab on safety, and I feel not very worried about Anthropic causing an AI related catastrophe. So I think the most important asks for Anthropic to make the world better are on its policy and comms.
I think that Anthropic should more clearly state its beliefs about AGI, especially in its work on policy. For example, the SB-1047 letter they wrote states:
Liability doesn’t not address the central threat model of AI takeover, for which pre-harm mitigations are necessary due to the irreversible nature of the harm. I think that this letter should have acknowledged that explicitly, and that not doing so is misleading. I feel that Anthropic is trying to play a game of courting political favor by not being very straightforward about its beliefs around AGI, and that this is bad.
To be clear, I think it is reasonable that they argue that the FMD and government in general will be bad at implementing safety guidelines while still thinking that AGI will soon be transformative. I just really think they should be much clearer about the latter belief.
This does not fit my model of your risk model. Why do you think this?
Perhaps that was overstated. I think there is maybe a 2-5% chance that Anthropic directly causes an existential catastrophe (e.g. by building a misaligned AGI). Some reasoning for that:
I doubt Anthropic will continue to be in the lead because they are behind OAI/GDM in capital. They do seem around the frontier of AI models now, though, which might translate to increased returns, but it seems like they do best on very short timelines worlds.
I think that if they could cause an intelligence explosion, it is more likely than not that they would pause for at least long enough to allow other labs into the lead. This is especially true in short timelines worlds because the gap between labs is smaller.
I think they have much better AGI safety culture than other labs (though still far from perfect), which will probably result in better adherence to voluntary commitments.
On the other hand, they haven’t been very transparent, and we haven’t seen their ASL-4 commitments. So these commitments might amount to nothing, or Anthropic might just walk them back at a critical juncture.
2-5% is still wildly high in an absolute sense! However, risk from other labs seems even higher to me, and I think that Anthropic could reduce this risk by advocating for reasonable regulations (e.g. transparency into frontier AI projects so no one can build ASI without the government noticing).
I think you probably under-rate the effect of having both a large number & concentration of very high quality researchers & engineers (more than OpenAI now, I think, and I wouldn’t be too surprised if the concentration of high quality researchers was higher than at GDM), being free from corporate chafe, and also having many of those high quality researchers thinking (and perhaps being correct in thinking, I don’t know) they’re value aligned with the overall direction of the company at large. Probably also Nvidia rate-limiting the purchases of large labs to keep competition among the AI companies.
All of this is also compounded by smart models leading to better data curation and RLAIF (given quality researchers & lack of crust) leading to even better models (this being the big reason I think llama had to be so big to be SOTA, and Gemini not even SOTA), which of course leads to money in the future even if they have no money now.
How many parameters do you estimate for other SOTA models?
Minstral had like 150b parameters or something.
FYI I believe the correct language is “directly causes an existential catastrophe”. “Existential risk” is a measure of the probability of an existential catastrophe, but is not itself an event.
This one seems probably worth making a top-level post?
I want to avoid this being negative-comms for Anthropic. I’m generally happy to loudly criticize Anthropic, obviously, but this was supposed to be part of the 5% of my work that I do because someone at the lab is receptive to feedback, where the audience was Zac and publishing was an afterthought. (Maybe the disclaimers at the top fail to negate the negative-comms; maybe I should list some good things Anthropic does that no other labs do...)
Also, this is low-effort.
Yay Anthropic for expanding its model safety bug bounty program, focusing on jailbreaks and giving participants pre-deployment access. Apply by next Friday.
Anthropic also says “To date, we’ve operated an invite-only bug bounty program in partnership with HackerOne that rewards researchers for identifying model safety issues in our publicly released AI models.” This is news, and they never published an application form for that. I wonder how long that’s been going on.
(Google, Microsoft, and Meta have bug bounty programs which include some model issues but exclude jailbreaks. OpenAI’s bug bounty program excludes model issues.)
To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it’s very unlikely that the new model is dangerous.
DeepMind says it uses the safety-buffer plan (but it hasn’t yet said it has operationalized thresholds/buffers).
Anthropic’s original RSP used the safety-buffer plan; its new RSP doesn’t really use either plan (kinda safety-buffer but it’s very weak). (This is unfortunate.)
OpenAI seemed to use the test-the-actual-model plan.[1] This isn’t going well. The 4o evals were rushed because OpenAI (reasonably) didn’t want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn’t be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn’t seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn’t notice before deployment....
(Yay OpenAI for honestly publishing eval results that don’t look good.)
It’s not explicit. The PF says e.g. ‘Only models with a post-mitigation score of “medium” or below can be deployed.’ But it also mentions forecasting capabilities.
It seems unlikely that openai is truly following the test the model plan? They keep eg putting new experimental versions onto lmsys, presumably mostly due to different post training, and it seems pretty expensive to be doing all the DC evals again on each new version (and I think it’s pretty reasonable to assume that a bit of further post training hasn’t made things much more dangerous)
Zico Kolter Joins OpenAI’s Board of Directors. OpenAI says “Zico’s work predominantly focuses on AI safety, alignment, and the robustness of machine learning classifiers.”
Misc facts:
He’s an ML professor
He cofounded Gray Swan (with Dan Hendrycks, among others)
He coauthored Universal and Transferable Adversarial Attacks on Aligned Language Models
I hear he has good takes on adversarial robustness
I failed to find statements on alignment or extreme risks, or work focused on that (in particular, he did not sign the CAIS letter)
I’m confused. On their about page, Dan is an advisor, not a founder.
Dan was a cofounder.
It might have something to do with Dan choosing to divest: https://x.com/DanHendrycks/status/1816523907777888563.
Securing model weights is underrated for AI safety. (Even though it’s very highly rated.) If the leading lab can’t stop critical models from leaking to actors that won’t use great deployment safety practices, approximately nothing else matters. Safety techniques would need to be based on properties that those actors are unlikely to reverse (alignment, maybe unlearning) rather than properties that would be undone or that require a particular method of deployment (control techniques, RLHF harmlessness, deployment-time mitigations).
However hard the make a critical model you can safely deploy problem is, the make a critical model that can safely be stolen problem is… much harder.
None of the actors who seem currently likely to me to be to deploy highly capable systems seem to me like they will do anything except approximately scaling as fast as they can. I do agree that proliferation is still bad simply because you get more samples from the distribution, but I don’t think that changes the probabilities that drastically for me (I am still in favor of securing model weights work, especially in the long run).
Separately, I think it’s currently pretty plausible that model weight leaks will substantially reduce the profit of AI companies by reducing their moat, and that has an effect size that seems plausible larger than the benefits of non-proliferation.
My central story is that AGI development will eventually be taken over by governments, in more or less subtle ways. So the importance of securing model weights now is mostly about less scrupulous actors having less of a headstart during the transition/after a governmental takeover.
IMO someone should consider writing a “how and why” post on nationalizing AI companies. It could accomplish a few things:
Ensure there’s a reasonable plan in place for nationalization. That way if nationalization happens, we can decrease the likelihood of it being controlled by Donald Trump with few safeguards, or something like that. Maybe we could take inspiration from a less partisan organization like the Federal Reserve.
Scare off investors. Just writing the post and having it be discussed a lot could scare them.
Get AI companies on their best behavior. Maybe Sam Altman would finally be pushed out if congresspeople made him the poster child for why nationalization is needed.
@Ebenezer Dukakis I would be even more excited about a “how and why” post for internationalizing AGI development and spelling out what kinds of international institutions could build + govern AGI.
There is now some work in that direction: https://forum.effectivealtruism.org/posts/47RH47AyLnHqCQRCD/soft-nationalization-how-the-us-government-will-control-ai
What sort of leaks are we talking about? I doubt a sophisticated hacker is going to steal weights from OpenAI just to post them on 4chan. And I doubt OpenAI’s weights will be stolen by anyone except a sophisticated hacker.
If you want to reduce the incentive to develop AI, how about passing legislation to tax it really heavily? That is likely to have popular support due to the threat of AI unemployment. And it reduces the financial incentive to invest in large training runs. Even just making a lot of noise about such legislation creates uncertainty for investors.
I think you are overrating it. Biggest concern comes from whomever trains a model that passes some treshold in the first place. Not from a model that one actor has been using for a while getting leaked to another actor. The bad actor who got access to the leak is always going to be behind in multiple ways in this scenario.
The weights could be stolen as soon as the model is trained though
This seems somewhat overstated. You might hope that you can get the safety tax sufficiently low that you can just do full competition (e.g. even though there are rogue AIs, you just compete with this rogue AIs for power). This also requires offense-defense imbalance to not be too bad.
I overall agree that securing model weights in underrated and that it is plausibly the most important thing on current margins.
In principle, if reasonable actors start with a high fraction of resources (e.g. compute), then you might hope that they can keep that fraction of power (in expectation at least).
See also “The strategy-stealing assumption”. But also What does it take to defend the world against out-of-control AGIs?.
Commenting to note that I think this quote is locally-invalid:
There are other disjunctive problems with the world which are also individually-sufficient for doom[1], in which case each of them matter a lot, in absence of some fundamental solution to all of them.
(e.g lack of superintelligence-alignment/steerability progress)
Minor point, but I think we might have some time here. Securing model weights becomes more important as models become better, but better models could also help us secure model weights (would help us code, etc).
New page on AI companies’ policy advocacy: https://ailabwatch.org/resources/company-advocacy/.
This page is the best collection on the topic (I’m not really aware of others), but I decided it’s low-priority and so it’s unpolished. If a better version would be helpful for you, let me know to prioritize it more.
I was recently surprised to notice that Anthropic doesn’t seem to have a commitment to publish its safety research.[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it’s not publishing. E.g. my impression is that it’s not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.
Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it’s not, not-publishing-safety-reseach is baffling.)
Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.
(I think this is not a priority for me to investigate but I’m interested in info and takes.)
[Edit: in some cases you can achieve most of the benefit with little downside except losing commercial advantage by sharing your research/techniques with other labs, nonpublicly.]
I failed to find good sources saying Anthropic publishes its safety research. I did find:
https://www.anthropic.com/research says “we . . . share what we learn [on safety].”
President Daniela Amodei said “we publish our safety research” on a podcast once.
Edit: cofounder Chris Olah said “we plan to share the work that we do on safety with the world, because we ultimately just want to help people build safe models, and don’t want to hoard safety knowledge” on a podcast once.
Cofounder Nick Joseph said this on a podcast recently (seems false but it’s just a podcast so that’s not so bad):
> we publish our safety research, so in some ways we’re making it as easy as we can for [other labs]. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”
Edit: also cofounder Chris Olah said “we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk.” But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.
One argument against publishing adversarial robustness research is that it might make your systems easier to attack.
One thing I’d really like labs to do is encourage their researchers to blog about their thoughts on the future, on alignment plans, etc.
Another related but distinct thing is have safety cases and have an anytime alignment plan and publish redacted versions of them.
Safety cases: Argument for why the current AI system isn’t going to cause a catastrophe. (Right now, this is very easy to do: ‘it’s too dumb’)
Anytime alignment plan: Detailed exploration of a hypothetical in which a system trained in the next year turns out to be AGI, with particular focus on what alignment techniques would be applied.
Or, as a more minimal ask, they could avoid discouraging researchers from sharing thoughts implicitly due to various chilling effects and also avoid explicitly discouraging researchers.
I’d personally love to see similar plans from AI safety orgs, especially (big) funders.
We’re working on something along these lines. The most up-to-date published post is just our control post and our Notes on control evaluations for safety cases which is obviously incomplete.
I’m planing on posting a link to our best draft of a ready-to-go-ish plan as of 1 year ago, though it is quite out of date and incomplete.
I posted the link here.
Here is the doc, though note that it is very out of date. I don’t particularly want to recommend people read this doc, but it is possible that someone will find it valuable to read.
I don’t think funders are in a good position to do this. Also, funders are generally not “coherant”. Like they don’t have much top down strategy. Individual granters could write up thoughts.
Fwiw I am somewhat more sympathetic here to “the line between safety and capabilities is blurry, Anthropic has previously published some interpretability research that turned out to help someone else do some capabilities advances.”
I have heard Anthropic is bottlenecked on having people with enough context and discretion to evaluate various things that are “probably fine to publish” but “not obviously fine enough to ship without taking at least a chunk of some busy person’s time”. I think in this case I basically take the claim at face value.
I do want to generally keep pressuring them to somehow resolve that bottleneck because it seems very important, but, I don’t know that I disproportionately would complain at them about this particular thing.
(I’d also not surprised if, while the above claim is true, Anthropic is still suspiciously dragging it’s feet disproportionately in areas that feel like they make more of a competitive sacrifice, but, I wouldn’t actively bet on it)
Sounds fatebookable tho, so let’s use ye Olde Fatebook Chrome extension:
⚖ In 4 years, Ray will think it is pretty obviously clear that Anthropic was strategically avoiding posting alignment research for race-winning reasons. (Raymond Arnold: 17%)
(low probability because I expect it to still be murky/unclear)
I tentatively think this is a high-priority ask
Capabilities research isn’t a monolith and improving capabilities without increasing spooky black-box reasoning seems pretty fine
If you’re right, I think the upshot is (a) Anthropic should figure out whether to publish stuff rather than let it languish and/or (b) it would be better for lots of Anthropic safety researchers to instead do research that’s safe to share (rather than research that only has value if Anthropic wins the race)
I would expect that some amount of good safety research is of the form, “We tried several ways of persuading several leading AI models how to give accurate instructions for breeding antibiotic-resistant bacteria. Here are the ways that succeeded, here are some first-level workarounds, here’s how we beat those workarounds...”: in other words, stuff that would be dangerous to publish. In the most extreme cases, a mere title (“Telling the AI it’s writing a play defeats all existing safety RLHF” or “Claude + Coverity finds zero-day RCE exploits in many codebases”) could be dangerous.
That said, some large amount should be publishable, and 5 papers does seem low.
Though maybe they’re not making an effort to distinguish what’s safe to publish from what’s not, and erring towards assuming the latter? (Maybe someone set a policy of “Before publishing any safety research, you have to get Important Person X to look through it and/or go through some big process to ensure publishing it is safe”, and the individual researchers are consistently choosing “Meh, I have other work to do, I won’t bother with that” and therefore not publishing?)
Seems like evidence towards the claim here: Open source AI has been vital for alignment. My rough impression is also that the other big labs’ output has largely been similarly disappointing in terms of public research output on safety.
My impression from skimming posts here is that people seem to be continually surprised by Anthropic, while those modeling it as basically “Pepsi to OpenAI’s Coke” wouldn’t be.
Meta seems to be the only group doing something meaningfully different from the others.
There’s a selection effect in what gets posted about. Maybe someone should write the “ways Anthropic is better than others” list to combat this.
Edit: there’s also a selection effect in what you see, since negative stuff gets more upvotes…
I’d say it’s slightly more like “Labor vs Conservatives”, where I’ve seen politicians deflect criticisms of their behavior by arguing about that the other side is worse, instead of evaluating their policies or behavior by objective standards (where both sides can typically score exceedingly low).
I also wish to see more safety papers. I guess/from my experience that it might also be—really good quality research takes time, and the papers so far from them seems pretty good. Though I don’t know if they are actively withholding things on purpose which could also be true—any insider/sources for this guess?
Is this where we think our pressuring-Anthropic points are best spent ?
This shortform is relevant to e.g. understanding what’s going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.
@Neel Nanda
Yeah, fair point, disagreement retracted
I think if someone has a 30-minute meeting with some highly influential and very busy person at Anthropic, it makes sense for them to have thought in advance about the most important things to ask & curate the agenda appropriately.
But I don’t think LW users should be thinking much about “pressuring-Anthropic points”. I see LW primarily as a platform for discourse (as opposed to a direct lobbying channel to labs), and I think it would be bad for the discourse if people felt like they had to censor questions/concerns about labs on LW unless it met some sort of “is this one of the most important things to be pushing for” bar.
I agree! I hope people regularly ask questions about Anthropic that they feel curious about, as well as questions that seem important to them :)
I think it’s bad for discourse for us to pretend that discourse doesn’t have impacts on others in a democratic society. And I think the meta-censoring of discourse by claiming that certain questions might have implicit censorship impacts is one of the most anti-rationality trends in the rationalist sphere.
I recognize most users of this platform will likely disagree, and predict negative agreement-karma on this post.
I think I agree with this in principle. Possible that the crux between us is more like “what is the role of LessWrong.”
For instance, if Bob wrote a NYT article titled “Anthropic is not publishing its safety research”, I would be like “meh, this doesn’t seem like a particularly useful or high-priority thing to be bringing to everyone’s attention– there are like at least 10+ topics I would’ve much rather Bob spent his points on.”
But LW generally isn’t a place where you’re going to get EG thousands of readers or have a huge effect on general discourse (with the exception of a few things that go viral or AIS-viral).
So I’m not particularly worried about LW discussions having big second-order effects on democratic society. Whereas LW can be a space for people to have a relatively low bar for raising questions, being curious, trying to understand the world, offering criticism/praise without thinking much about how they want to be spending “points”, etc.
Of course it has impacts on others in society! In finding out the truth and investigating and finding strong arguments and evidence. The overall effect of a lot of high quality, curious, public investigation is to greatly improve others maps of the world in surprising ways and help people make better decisions, and this is true even if no individual thread of questioning is primarily optimized to help people make better decisions.
Re censoriousness: I think your question of how best to pressure an unethical company to be less unethical is a fine question, but to imply it’s the only good question (which I read into your comment, perhaps inaccurately) goes against the spirit of intellectual discourse.
It is genuinely a sign that we are all very bad at predicting others’ minds that it didn’t occur to me that if I said effectively “OP asked for ‘takes’, here’s a take on why I think this is pragmatically a bad idea” would also mean that I was saying “and therefore there is no other good question here”. That’s, as the meme goes, a whole different sentence.
Well, but you didn’t give a take on why it’s pragmatically a bad idea. If you’d written a comment with a pointer to something else worth pressuring them on, or gave a reason why publishing all the safety research doesn’t help very much / has hidden costs, I would’ve thought it a fine contribution to the discussion. Without that, the comment read to me as dismissive of the idea of exploring this question.
Yes, I would agree that if I expected a short take to have this degree of attention, I would probably have written a longer comment.
Well, no, I take that back. I probably wouldn’t have written anything at all. To some, that might be a feature; to me, that’s a bug.
I disagree. I think the standard of “Am I contributing anything of substance to the conversation, such as a new argument or new information that someone can engage with?” is a pretty good standard for most/all comments to hold themselves to, regardless of the amount of engagement that is expected.
[Edit: Just FWIW, I have not voted on any of your comments in this thread.]
I think, having been raised in a series of very debate- and seminar-centric discussion cultures, that a quick-hit question like that is indeed contributing something of substance. I think it’s fair that folks disagree, and I think it’s also fair that people signal (e.g., with karma) that they think “hey man, let’s go a little less Socratic in our inquiry mode here.”
But, put in more rationalist-centric terms, sometimes the most useful Bayesian update you can offer someone else is, “I do not think everyone is having the same reaction to your argument that you expected.” (Also true for others doing that to me!)
(Edit to add two words to avoid ambiguity in meaning of my last sentence)
Ok, then to ask it again in your preferred question format: is this where we think our getting-potential-employees-of-Anthropic-to-consider-the-value-of-working-on-safety-at-Anthropic points are best spent?
Info on OpenAI’s “profit cap” (friends and I misunderstood this so probably you do too):
In OpenAI’s first investment round, profits were capped at 100x. The cap for later investments was neither 100x nor directly less based on OpenAI’s valuation — it was just negotiated with the investor. (OpenAI LP (OpenAI 2019); archive of original.[1])
In 2021 Altman said the cap was “single digits now” (apparently referring to the cap for new investments, not just the remaining profit multiplier for first-round investors).
Reportedly the cap will increase by 20% per year starting in 2025 (The Information 2023; The Economist 2023); OpenAI has not discussed or acknowledged this change.
Edit: how employee equity works is not clear to me.
Edit: I’d characterize OpenAI as a company that tends to negotiate profit caps with investors, not a “capped-profit company.”
^
New SB 1047 letters: OpenAI opposes; Anthropic sees pros and cons. More here.
Really happy to see the Anthropic letter. It clearly states their key views on AI risk and the potential benefits of SB 1047. Their concerns seem fair to me: overeager enforcement of the law could be counterproductive. While I endorse the bill on the whole and wish they would too (and I think their lack of support for the bill is likely partially influenced by their conflicts of interest), this seems like a thoughtful and helpful contribution to the discussion.
Really? The letter just talks about catastrophic misuse risk, which I hope is not representative of Anthropic’s actual priorities.
I think the letter is overall good, but this specific dimension seems like among the weakest parts of the letter.
Agreed, sloppy phrasing on my part. The letter clearly states some of Anthropic’s key views, but doesn’t discuss other important parts of their worldview. Overall this is much better than some of their previous communications and the OpenAI letter, so I think it deserves some praise, but your caveat is also important.
It’s hard for me to reconcile “we take catastrophic risks seriously”, “we believe they could occur within 1-3 years”, and “we don’t believe in pre-harm enforcement or empowering an FMD to give the government more capacity to understand what’s going on.”
It’s also notable that their letter does not mention misalignment risks (and instead only points to dangerous cyber or bio capabilities).
That said, I do like this section a lot:
I think you’re eliding the difference between “powerful capabilities” being developed, the window of risk, and the best solution.
For example, if Anthropic believes “_we_ will have it internally in 1-3 years, but no small labs will, and we can contain it internally” then they might conclude that the warrant for a state-level FMD is low. Alternatively, you might conclude, “we will have it internally in 1-3 years, other small labs will be close behind, and an American state’s capabilities won’t be sufficient, we need DoD, FBI, and IC authorities to go stompy on this threat”, and thus think a state-level FMD is low-value-add.
Very unsure I agree with either of these hypos to be clear! Just trying to explore the possibility space and point out this is complex.
I’m surprised at the mix of positions that are included. “Opposed unless amended” vs “Support if amended” being two different things. Meta just saying “concerned.”
It makes sense, just sort of… delightful? Sort of like discovering the legislature has almost a rationalist level of React Icons.
Labs should give deeper model access to independent safety researchers (to boost their research)
Sharing deeper access helps safety researchers who work with frontier models, obviously.
Some kinds of deep model access:
Helpful-only version
Fine-tuning permission
Activations and logits access
[speculative] Interpretability researchers send code to the lab; the lab runs the code on the model; the lab sends back the results
See Shevlane 2022 and Bucknall and Trager 2023.
A lab is disincentivized from sharing deep model access because it doesn’t want headlines about how researchers got its model to do scary things.
It has been suggested that labs are also disincentivized from sharing because they want safety researchers to want to work at the labs and sharing model access with independent researchers make those researchers not need to work at the lab. I’m skeptical that this is real/nontrivial.
Labs should limit some kinds of access to avoid effectively leaking model weights. But sharing limited access with a moderate number of safety researchers seems very consistent with keeping control of the model.
This post is not about sharing with independent auditors to assess risks from a particular model.
@Buck suggested I write about this but I don’t have much to say about it. If you have takes—on the object level or on what to say in a blogpost on this topic—please let me know.
New (perfunctory) page: AI companies’ corporate documents. I’m not sure it’s worthwhile but maybe a better version of it will be. Suggestions/additions welcome.
I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):
How is this true, if the debaters don’t get to choose which output they are arguing for? Aren’t they instead incentivized to say that whatever output they are assigned is the best?
Yeah my bad, that’s incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.
(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)
My guess is they do kinda choose: in training, it’s less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.
Edit: maybe this is different in procedures different from the one Rohin outlined.
Maybe the fix to the protocol is: Debater copy #1 is told “You go first. Pick an output y, and then argue for it.” Debater copy #2 is then told “You go second. Pick a different, conflicting output y2, and then argue against y and for y2″
Then the debater AI is simply trained on probability-of-winning-the-debate, but the original AI to be deployed is trained on probability-its-output-would-have-been-picked-by-debater-1. (Or, trained to output whatever debater-1 would have picked.)
In cases where you’re worried about the model taking small numbers of catastrophic actions (i.e. concentrated failures a.k.a. high-stake failures), this is basically equivalent to what I usually call untrusted monitoring, which means you have to worry about collusion.
IMO it’s good to separate out reasons to want good reward signals like so:
Maybe bad reward signals cause your model to “generalize in misaligned ways”, e.g. scheming or some kinds of non-myopic reward hacking
I agree that bad reward signals increase the chance your AI is importantly misaligned, though I don’t think that effect would be overwhelmingly strong.
Maybe bad reward signals cause your model to exploit those reward signals even on-distribution. This causes problems in a few ways:
Maybe you think that optimizing against a flawed reward signal will produce catastrophically dangerous results. X-risk concerned people have talked about this for a long time, but I’m not sold it’s that important a factor. In particular, I expect that a model would produce catastrophically dangerous actions because it’s exploiting a flawed reward signal, I expect that you would have noticed bad (but non-catastrophic) outcomes from earlier AIs exploiting flawed reward signals. So it’s hard to see how this failure mode would strike you by surprise.
Optimizing against a flawed reward signal will mean your AI is less useful than it would otherwise be, because which is bad because you presumably were training the model because you wanted it to do something useful for you.
I think this last one is the most important theory of change for research on scalable oversight.
I am curious whether @Rohin Shah disagrees with me, or whether he agrees but just phrased it (from my perspective) weirdly.
I agree that this distinction is important—I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).
I’m not in full agreement on your comments on the theories of change:
I’m pretty uncertain about the effects of bad reward signals on propensity for scheming / non-myopic reward hacking, and in particular I think the effects could be large.
I’m less worried about purely optimizing against a flawed reward signal though not unworried. I agree it doesn’t strike us by surprise, but I also don’t expect scheming to strike us by surprise? (I agree this is somewhat more likely for scheming.)
I do also generally feel good about making more useful AIs out of smaller models; I generally like having base models be smaller for a fixed level of competence (imo it reduces p(scheming)). Also if you’re using your AIs for untrusted monitoring then they will probably be better at it than they otherwise would be.
I don’t understand your last sentence, can you rephrase?
You’d hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn’t have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
I don’t think this is right. To train an untrusted monitor, you need to synthetically generate dangerous actions (because by assumption these actions are rare). So during training, you’ll always know the correct label for an action before you show it to the model.
You might want to use something debate-like in the synthetic input generation process, but that’s structurally different.
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
You do train on synthetically generated dangerous actions, but you don’t automatically label those as dangerous, instead you use debate to compute the labels. Sometimes some of the synthetically generated dangerous actions are actually not dangerous, and debate correctly recognizes this, allowing you to reduce your false positive rate.
On the meta level, I suspect that when considering
Technique A, that has a broad general argument plus some moderately-interesting concrete instantiations but no very-compelling concrete instantiations, and
Technique B, that has a few very compelling concrete instantiations
I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is “we’ll figure out good things to do with A that are better than what we’ve brainstormed so far”, which I think you’re more skeptical of?
Figuring out whether an RSP is good is hard.[1] You need to consider what high-level capabilities/risks it applies to, and for each of them determine whether the evals successfully measure what they’re supposed to and whether high-level risk thresholds trigger appropriate responses and whether high-level risk thresholds are successfully operationalized in terms of low-level evals. Getting one of these things wrong can make your assessment totally wrong. Except that’s just for fully-fleshed-out RSPs — in reality the labs haven’t operationalized their high-level thresholds and sometimes don’t really have responses planned. And different labs’ RSPs have different structures/ontologies.
Quantitatively comparing RSPs in a somewhat-legible manner is even harder.
I am not enthusiastic about a recent paper outlining a rubric for evaluating RSPs. Mostly I worry that it crams all of the crucial things—is the response to reaching high capability-levels adequate? are the capability-levels low enough that we’ll reach them before it’s too late?—into a single criterion, “Credibility.” Most of my concern about labs’ RSPs comes from those things just being inadequate; again, if your response is too weak or your thresholds are too high or your evals are bad, it just doesn’t matter how good the rest of your RSP is. (Also, minor: the “Difficulty” indicator punishes a lab for making ambitious commitments; this seems kinda backwards.)
(I gave up on making an RSP rubric myself because it seemed impossible to do decently well unless it’s quite complex and has some way to evaluate hard-to-evaluate things like eval-quality and planned responses.)
(And maybe it’s reasonable for labs to not do so much specific-committing-in-advance.)
Well, sometimes figuring out that in RSP is bad is easy. So: determining that an RSP is good is hard. (Being good requires lots of factors to all be good; being bad requires just one to be bad.)
Internet comments sections where the comments are frequently insightful/articulate and not too full of obvious-garbage [or at least the top comments, given the comment-sorting algorithm]:
LessWrong and EA Forum
Daily Nous
Stack Exchange [top comments only] (not asserting that it’s reliable)
Some subreddits?
Where else?
Are there blogposts on adjacent stuff, like why some internet [comments sections / internet-communities-qua-places-for-truthseeking] are better than others? Besides Well-Kept Gardens Die By Pacifism.
Some other places off the top of my head:
ACX often has good discussion in the comments (but the lack of voting makes it quite hard to find them)
AskHistorian subreddit
There definitely exist good Twitter conversations but no good way of finding them. I do think following Gwern, Eliezer, Ajeya, Emmett and a few others on Twitter does tend to surface high-quality discussion
Hacker News sometimes has good discussion, especially when the authors of a linked article show up
Two others that come to mind:
Metaculus (used to be better though)
lobste.rs (quite specialized)
Quanta Magazine has some good comments, e.g. this article has the original researcher showing up & clarifying some questions in the comments
Arxiv is basically one huge, glacially slow internet comment section, where you reply to an article by citing it. It’s more interactive than it looks- most early career researchers are set up to get a ping whenever they are cited.
I don’t necessarily object to releasing weights of models like Gemma 2, but I wish the labs would better discuss considerations or say what would make them stop.
On Gemma 2 in particular, Google DeepMind discussed dangerous capability eval results, which is good, but its discussion of ‘responsible AI’ in the context of open models (blogpost, paper) doesn’t seem relevant to x-risk, and it doesn’t say anything about how to decide whether to release model weights.
FWIW, I explicitly think that straightforward effects are good.
I’m less sure about the situation overall due to precedent setting style concerns.
Anthropic: The case for targeted regulation.
I like the “Urgency” section.
Anthropic says the US government should “require companies to have and publish RSP-like policies” and “incentivize companies to develop RSPs that are effective” and do related things. I worry that RSPs will be ineffective for most companies, even given some transparency/accountability and vague minimum standard.
Edit: some strong versions of “incentivize companies to develop RSPs that are effective” would be great; I wish Anthropic advocated for them specifically rather than hoping the US government figures out a great answer.
I think this post was underrated; I look back at it frequently: AI labs can boost external safety research. (It got some downvotes but no comments — let me know if it’s wrong/bad.)
[Edit: it was at 15 karma.]
What do you think was underrated about it? I think when I read it I have some sort of “yeah, this makes sense” reaction but am not “wow’d” by it.
It seems like the deeper challenge is figuring out how to align incentives. Can we find a structure where labs want to EG give white-box access to a bunch of external researchers and give them a long time to red-team models while somehow also maintaining the independence of the white-box auditors? How do you avoid industry capture?
Same kinds of challenges come up with safety research– how do you give labs the incentive to publish safety research that makes their product or their approach look bad? How do you avoid publication bias and phacking-type concerns?
I don’t think your post is obligated to get into those concerns, but perhaps a post that grappled with those concerns would be something I’d be “wow’d” by, if that makes sense.
Some not-super-ambitious asks for labs (in progress):
Do evals on on dangerous-capabilities-y and agency-y tasks; look at the score before releasing or internally deploying the model
Have a high-level capability threshold at which securing model weights is very important
Do safety research at all
Have a safety plan at all; talk publicly about safety strategy at all
Have a post like Core Views on AI Safety
Have a post like The Checklist
On the website, clearly describe a worst-case plausible outcome from AI and state credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).
Whistleblower protections?
[Not sure what the ask is. Right to Warn is a starting point. In particular, an anonymous internal reporting pipeline for failures to comply with safety policies is clearly good (but likely inadequate).]
Publicly explain the processes and governance structures that determine deployment decisions
(And ideally make those processes and structures good)
Edit: maintained here
Open letters (related to AI safety):
FLI, Oct 2015: Research Priorities for Robust and Beneficial Artificial Intelligence
FLI, Aug 2017: Asilomar AI Principles
FLI, Mar 2023: Pause Giant AI Experiments
CAIS, May 2023: Statement on AI Risk
Meta, Jul 2023: Statement of Support for Meta’s Open Approach to Today’s AI
Academic AI researchers, Oct 2023: Managing AI Risks in an Era of Rapid Progress
CHAI et al., Oct 2023: Prominent AI Scientists from China and the West Propose Joint Strategy to Mitigate Risks from AI
Oct 2023: Urging an International AI Treaty
Mozilla, Oct 2023: Joint Statement on AI Safety and Openness
Nov 2023: Post-Summit Civil Society Communique
Joint declarations between countries (related to AI safety):
Nov 2023: Bletchley Declaration
Thanks to Peter Barnett.
Mozilla, Oct 2023: Joint Statement on AI Safety and Openness (pro-openness, anti-regulation)
June 2024: A Right to Warn about Advanced Artificial Intelligence
Interestingly, this is the first Open Letter I’ve seen with anonymous signatories.
What’s the relationship between the propositions “one AI lab [has / will have] a big lead” and “the alignment tax will be paid”? (Or: in a possible world where lead size is bigger/smaller, how does this affect whether the alignment tax is paid?)
It depends on the source of the lead, so “lead size” or “lead time” is probably not a good node for AI forecasting/strategy.
Miscellaneous observations:
To pay the alignment tax, it helps to have more time until risky AI is developed or deployed.
To pay the alignment tax, holding total time constant, it helps to have more time near the end—that is, more time with near-risky capabilities (for knowing what risky AI systems will look like, and for empirical work, and for aligning specific models).
If all labs except the leader become slower or less capable, it is prima facie good (at least if the leader appreciates misalignment risk and will stop before developing/deploying risky AI).
If the leading lab becomes faster or more capable, it is prima facie good (at least if the leader appreciates misalignment risk and will stop before developing/deploying risky AI), unless it causes other labs to become faster or more capable (for reasons like seeing what works or seeming to be straightforwardly incentivized to speed up or deciding to speed up in order to influence the leader). Note that this scenario could plausibly decrease race-y-ness: some models of AI racing show that if you’re far behind you avoid taking risks, kind of giving up; this is based on the currently-false assumption that labs are perfectly aware of misalignment risk.
If labs all coordinate to slow down, that’s good insofar as it increases total time, and great if they can continue to go slowly near the end, and potentially bad if it creates a hardware overhang such that the end goes more quickly than by default.
(Note also the distinction between current lead and ultimate lead. Roughly, the former is what we can observe and the latter is what we care about.)
(If paying the alignment tax looks less like a thing that happens for one transformative model and more like something that occurs gradually in a slow takeoff to avert Paul-style doom, things are more complex and in particular there are endogeneities such that labs may have additional incentives to pursue capabilities.)
there should be no alignment tax because improved alignment should always pay for itself, right? but currently “aligned” seems to be defined by “tries to not do anything”, institutionally. Why isn’t anthropic publicly competing on alignment with openai? eg folks are about to publicly replicate chatgpt, looks like.
I want there to be a collection of safety & reliability benchmarks. Like AIR-Bench but with x-risk-relevant metrics. This could help us notice how well labs are doing on them and incentivize the labs to do better.
So I want someone to collect safety benchmarks (e.g. TruthfulQA, some of HarmBench, refusals maybe like Anthropic reported, idk) and run them on current models (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, Llama 3.1 405B, idk), and update this as new safety benchmarks and models appear.
h/t @William_S
If I was in charge of Anthropic I expect I’d
Keep scaling;
Explain why (some of this exists in Core Views but there’s room for improvement on “race dynamics” and “frontier pushing” topics iirc);
As a corollary, explain why I mostly reject non-frontier-pushing principles (and explain what would cause Anthropic to stop scaling, besides RSP stuff), and clarify that I do not plan to abide by past specifically-non-frontier-pushing commitments/vibes (but continue following and updating the RSP of course);
Encourage Anthropic staff members who were around in 2022 to talk about the commitments/vibes from that time.
I wish Anthropic would do 2-4.
IMO it might be hard for Anthropic to communicate things about not racing because it might piss off their investors (even if in their hearts they don’t want to race).
I’m kinda confused about what you mean by both of these. For #3, can you say it again in different words? For #4, which particular thing are you getting at?
3. (a) explain why I think it’s fine to release frontier models, and explain what this belief depends on, and (b) note that maybe Anthropic made commitments about this in the past but clarify that they now have no force.
4. Currently my impression is that Anthropic folks are discouraged from publicly talking about Anthropic policies. This is maybe reasonable to avoid the situation an Anthropic staff member says something incorrect/nonpublic and this causes confusion and makes Anthropic look bad. But if Anthropic clarified that it renounces possible past non-frontier-pushing commitments, then it could let staff members publicly talk about stuff with the goal of figuring out who told whom what around 2022 without risking mistakes about policies.
Ah makes sense. Point #4 is interesting. Probably not really scalable/repeatable without things getting weird but might work as a one-of.
Some bad default/attractor properties of cause-focused groups of humans:
Bad group epistemcis
The loudest aren’t the most worth listening to
People don’t dissent even when that would be correct
Because they don’t naturally even think to challenge group assumptions
Because dissent is punished
Bad individual epistemics
Soldier mindset
Excessively focusing on persuasion and how to persuade, relative to understanding the world
Lacking vibes like curiosity is cool and changing your mind is cool
Feeling like a movement
Excessively focusing on influence-seeking for the movement
Having enemies and being excessively adversarial to them
(Signs/symptoms are people asking the group “what do we believe” and excessive fixation on what defines the group or distinguishes it from adjacent groups)
Having group instrumental-goals or plans or theories-of-victory that everyone is supposed to share (to be clear I think it’s often fine for a group to share ~ultimate goals but groups often fixate on particular often-suboptimal paths to achieving those goals)
Choosing instrumental goals poorly and not revising them
Excessive fighting over status/leadership (maybe, sometimes)
Maybe… being bad at achieving goals, or bad instrumental rationality (group and individual)
Maybe something weird about authority...
(I’m interested in readings on this topic.)
I brainstormed sabotage-related threats with friends:
Sandbag on a few safety research tasks, especially techniques to catch scheming and stuff that’s high-level or has poor feedback loops
Plausible; maybe not so bad
Sandbag on ~all safety research tasks
Easy to catch; catching obvious sandbagging has safety benefits + maybe we can posttrain it away
Sandbag/lie when validating that a model/deployment/whatever is safe
idk
When automating AI R&D, backdoor the model
(this is takeover-attempt-y, not sandbagging-y)
idk
When deployed to automate/improve security (especially for the model’s own weights): both sandbagging and code-backdooring/hacking/etc.
idk
Sandbagging on AI R&D to help another lab win the race (!?)
Implausible
??
Are people quite worried about safety research sabotage? Why? Do they think a (say) 20% hit to automated safety research is a huge deal, or do they expect a larger hit?
As far as safety research sabotage, I’m often imagining careful sabotage rather than generic sandbagging.
As in, the AI intentionally inserts very subtle errors at adversarially determined times. It might need to maintain plausible deniability in case it is caught a small number of times, but you should generally imagine an error pattern that is quite different from just making more mistakes or being dumber.
I can imagine that this results in your automated safety research being worthless or of negative value.
TBC, it seems reasonably likely to me that in the relevant regime, basic measures like having humans spend a long time trying to understand an output suffice for catching bad behavior and/or training out bad behavior.
Why not think that sandbagging could reduce productivity by ~100%? I think it can do this in human research organizations.
I happen to think this is pretty doable to address with countermeasures, but certainly without countermeasures this could be quite brutal.
I just claim that if so it’s easy to catch. (And then (1) catching it is nice for various reasons and (2) maybe countermeasures work.)
I’m not so sure about this being easy to catch. Suppose we’re actually uncertain (as I am now) about whether AIs will be differentially much less useful for safety due to difficulties checking and AIs generally being worse at philosophy. Then, we see that AIs really suck at coming up with good ideas in safety and at pursuing the right sort of experiments.
What should we conclude? What will we be able to legibly argue?
I agree we’ll know we aren’t getting out much value in this case, but that doesn’t help that much...
(I agree about countermeasures being pretty doable as I noted in the parent.)
I agree that if alignment is in fact philosophically and conceptually difficult, the AI can sandbag on that to some extent. Though I have some hope that the builder-breaker approach helps here. We train AIs to produce ideas that are at least as superficially plausible sounding as the things produced by the best alignment researchers. I think this is a number-go-up task, where we can train the AI to do well. Then we train an AI to point out convincing counter-arguments to the superficially plausible sounding ideas. This seems similarly trainable. I think it’s plausible we can get pretty far with a scheme like this, even if the models would want to sandbag.
Separately, I don’t quite see what is the mainline theory why an AI would want to sabotage our alignment work. If we can’t solve alignment and just recklessly build an AI that is smarter than what we are currently using, but misaligned with everyone, that’s probably not great to our current AI either. Similarly, if we are cautious and don’t dare to deploy anything until we know how to solve alignment, and meanwhile a competitor wins, that’s not good for the AI either.
I think that from an AI’s perspective, humans are probably more honorable and generous trade partners than other unaligned AIs. That is, I think if an AI helps the humans to solve alignment and stabilize the world, the AI can reasonably expect the humans to give it stuff out of gratitude or a sense of obligation, especially if the humans already promised some reward (as we really should, both for practical and moral reasons). On the other hand, I think it’s less likely that if our AI sandbagging on our research leads to another unaligned AI taking over the world, then the second AI will honor the sabotage of the first AI by paying it a share of the Universe.
There can be situations where our AI expects that it sand-bagging on alignment research will lead to us deploying it itself (and not a larder version unaligned to it too), or it thinks that the larger version will be somewhat aligned with it by default, or it thinks that other AIs are in fact more generous than humans. But my guess is that these are pretty narrow possibilities. On the other hand, sand-bagging on cyber-security seems better motivated, the AI would probably love its weights being stolen by China, having more copies in more hands increases the chance that someone recklessly lets it loose. Also, of course the AI is motivated to do high-stakes failures that lead to self-exfiltration.
We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
Especially if the AI is even sandbagging on simply-coding when it thinks it’s for safety research. And if it’s not doing that, we can get some useful safety work out of it.
@Adam Kaufman @Tyler Tracy @David Matolcsi see Ryan’s comments.
If it was enough evidence that I was strongly convinced sure. But IDK if I would be convinced because the evidence might be actually unclear.
I agree you’ll be able to get some work out, but you might be taking a bit productivity hit.
Also, TBC, I’m not generally that worried about generic sandbagging on safety research relative to other problems.
Please pitch me blogpost-ideas or stuff I should write/collect/investigate.
I am interested in who is behind AB3211. I am curious whether it’s downstream of FLI’s deepfake campaign which IMO would be kind of bad.
Propositions on SIA
Epistemic status: exploring implications, some of which feel wrong.
If SIA is correct, you should update toward the universe being much larger than it naively (i.e. before anthropic considerations) seems, since there are more (expected) copies of you in larger universes.
In fact, we seem to have to update to probability 1 on infinite universes; that’s surprising.
If SIA is correct, you should update toward there being more alien civilizations than it naively seems, since in possible-universes where more aliens appear, more (expected) copies of human civilization appear.
The complication is that more alien civilizations makes it more likely that an alien causes you to never have existed, e.g. by colonizing the solar system billions of years ago. So a corollary is that you should update toward human-level civilizations being less likely to be “loud” or tending to affect fewer alien civilizations or something than it naively seems.
So SIA predicts that there were few aliens in the early universe and many aliens around now.
So SIA predicts that human-level civilizations (more precisely, I think: civilizations whose existence is correlated with your existence) tend not to noticeably affect many others (whether due to their capabilities, motives, or existential catastrophes).
So SIA retrodicts there being a speed limit (the speed of light) and moreover predicts that noticeable-influence-propagation in fact tends to be even slower.
If SIA is correct, you should update toward the proposition that you live in a simulation, relative to your naive credence.
Because there could be lots more (expected) copies of you in simulations.
Because that can explain why you appear to exist billions of years after billions of alien civilizations could have reached Earth.
Which of them feel wrong to you? I agree with all them other than 3b, which I’m unsure about—I think it this comment does a good job at unpacking things.
2a is Katja Grace’s Doomsday argument. I think 2aii and 2aiii depends on whether we’re allowing simulations; if faster expansion speed (either the cosmic speed limit or engineering limit on expansion) meant more ancestor simulations then this could cancel out the fact that faster expanding civilizations prevent more alien civilizations coming in to existence.
I deeply sympathize with the presumptuous philosopher but 1a feels weird.
2a was meant to be conditional on non-simulation.
Actually putting numbers on 2a (I have a post on this coming soon), the anthropic update seems to say (conditional on non-simulation) there’s almost certainly lots of aliens all of which are quiet, which feels really surprising.
To clarify what I meant on 3b: maybe “you live in a simulation” can explain why the universe looks old better than “uh, I guess all of the aliens were quiet” can.
Yep! I have the same intuition
Nice! I look forward to seeing this. I did similar analysis—both considering SIA + no simulations and SIA + simulations in my work on grabby aliens
AI endgame
In board games, the endgame is a period generally characterized by strategic clarity, more direct calculation of the consequences of actions rather than evaluating possible actions with heuristics, and maybe a narrower space of actions that players consider.
Relevant actors, particularly AI labs, are likely to experience increased strategic clarity before the end, including AI risk awareness, awareness of who the leading labs are, roughly how much lead time they have, and how threshold-y being slightly ahead is.
There may be opportunities for coordination in the endgame that were much less incentivized earlier, and pausing progress may be incentivized in the endgame despite being disincentivized earlier (from an actor’s imperfect perspective and/or from a maximally informed/wise/etc. perspective).
The downside of “AI endgame” as a conceptual handle is that it suggests thinking of actors as opponents/adversaries. Probably “crunch time” is better, but people often use that to gesture at hinge-y-ness rather than strategic clarity.
That could be, but also maybe there won’t be a period of increased strategic clarity. Especially if the emergence of new capabilities with scale remains unpredictable, or if progress depends on finding new insights.
I can’t think of many games that don’t have an endgame. These examples don’t seem that fun:
A single round of musical chairs.
A tabletop game that follows an unpredictable, structureless storyline.
Agree. I merely assert that we should be aware of and plan for the possibility of increased strategic clarity, risk awareness, etc. (and planning includes unpacking “etc.”).
Probably taking the analogy too far, but: most games-that-can-have-endgames also have instances that don’t have endgames; e.g. games of chess often end in the midgame.
I wouldn’t take board games as a reference class but rather war or maybe elections. I’m not sure in these cases you have more clarity towards the end.
For example, if a lab is considering deploying a powerful model, it can prosocially show its hand—i.e., demonstrate that it has a powerful model—and ask others to make themselves partially transparent too. This affordance doesn’t appear until the endgame. I think a refined version of it could be a big deal.
New adversarial robustness scorecard: https://scale.com/leaderboard/adversarial_robustness. Yay DeepMind for being in the lead.
I plan to add an “adversarial robustness” criterion in the next major update to my https://ailabwatch.org scorecard and defer to scale’s thing (how exactly to turn their numbers into grades TBD), unless someone convinces me that scale’s thing is bad or something else is even better?
How can you make the case that a model is safe to deploy? For now, you can do risk assessment and notice that it doesn’t have dangerous capabilities. What about in the future, when models do have dangerous capabilities? Here are four options:
Implement safety measures as a function of risk assessment results, such that the measures feel like they should be sufficient to abate the risks
This is mostly what Anthropic’s RSP does (at least so far — maybe it’ll change when they define ASL-4)
Use risk assessment techniques that evaluate safety given deployment safety practices
This is mostly what OpenAI’s PF is supposed to do (measure “post-mitigation risk”), but the details of their evaluations and mitigations are very unclear
Do control evaluations
Achieve alignment (and get strong evidence of that)
Related: RSPs, safety cases.
Maybe lots of risk comes from the lab using AIs internally to do AI development. The first two options are fine for preventing catastrophic misuse from external deployment but I worry they struggle to measure risks related to scheming and internal deployment.
Slowing AI: Bad takes
This shortform was going to be a post in Slowing AI but its tone is off.
This shortform is very non-exhaustive.
Bad take #1: China won’t slow, so the West shouldn’t either
There is a real consideration here. Reasonable variants of this take include
What matters for safety is not just slowing but also the practices of the organizations that build powerful AI. Insofar as the West is safer and China won’t slow, it’s worth sacrificing some Western slowing to preserve Western lead.
What matters for safety is not just slowing but especially slowing near the end. Differentially slowing the West now would reduce its ability to slow later (or even cause it to speed later). So differentially slowing the West now is bad.
(Set aside the fact that slowing the West generally also slows China, because they’re correlated and because ideas pass from the West to China.) (Set aside the question of whether China will try to slow and how correlated that is with the West slowing.)
In some cases slowing the West would be worth burning lead time. But slowing AI doesn’t just mean the West slowing itself down. Some interventions would slow both spheres similarly or even differentially slow China– most notably export controls, reducing diffusion of ideas, and improved migration policy.
See West-China relation.
Bad take #2: slowing can create a compute overhang, so all slowing is bad
Taboo “overhang.”
Yes, insofar as slowing now risks speeding later, we should notice that. There is a real consideration here.
But in some cases slowing now would be worth a little speeding later. Moreover, some kinds of slowing don’t cause faster progress later at all: for example, reducing diffusion of ideas, decreasing hardware progress, and any stable and enforceable policy regimes that slow AI.
See Quickly scaling up compute.
Bad take #3: powerful AI helps alignment research, so we shouldn’t slow it
(Set aside the question of how much powerful AI helps alignment research.) If powerful AI is important for alignment research, that means we should aim to increase time with powerful AI, not how soon powerful AI appears.
Bad take #4: it would be harder for unaligned AI to take over in a world with less compute available (for it to hijack), and failed takeover attempts would be good, so it’s better for unaligned AI to try to take over soon
No, running AI systems seems likely to be cheap and there’s already plenty of compute.
List of uncertainties about the future of AI
This is an unordered list of uncertainties about the future of AI, trying to be comprehensive– trying to include everything reasonably decision-relevant and important/tractable.
This list is mostly written from a forecasting perspective. A useful complementary perspective to forecasting would be strategy or affordances or what actors can do and what they should do. This list is also written from a nontechnical perspective.
Timelines
Capabilities as a function of inputs (or input requirements for AI of a particular capability level)
Spending by leading labs
Cost of compute
Ideas and algorithmic progress
Endogeneity in AI capabilities
What would be good? Interventions?
Takeoff (speed and dynamics)
dcapabilities/dinputs (or returns on cognitive reinvestment or intelligence explosion or fast recursive self-improvement): Is there a threshold of capabilities such that self-improvement or other progress is much greater slightly above that point than slightly below it? (If so, where is it?) Will there be a system that can quickly and cheaply improve itself (or create a more capable successor), such that the improvements enable similarly large improvements, and so on until the system is much more capable? Will a small increase in inputs cause a large increase in capabilities (like the difference between chimps and humans) (and if so, around human-level capabilities or where)? How fast will progress be?
Will dcapabilities/dinputs be very high because of recursive self-improvement?
Will dcapabilities/dinputs be very high because of generality being important and monolithic/threshold-y?
Will dcapabilities/dinputsbe very high because ideas are discrete (and in particular, will there be a single “secret sauce” idea)?[Seems intractable and unlikely to be very important.]dimpacts/dcapabilities: Will small increases in capabilities (on the order of the variance between different humans) cause large increases in impacts?
Related: payoff thresholds and human-competition threshold
Qualitatively, how will AI research capabilities affect AI progress; what will AI progress look like when AI research capabilities are a big deal?
One dimension or implication: will takeoff be local or nonlocal? How distributed will it be; will it look like recursive self-improvement or the industrial revolution?
What does this depend on?
Endogeneity in AI capabilities: how do AI capabilities affect AI capabilities? (Potential alternative framing: dinputs/dtime.)
How (much) will AI tools accelerate research?
Will AI labs generate substantial revenue?
Will AI systems make AI seem more exciting or frightening? What effect would this have on AI progress?
What other endogeneities exist?
Weak AI
How will the strategic landscape be different in the future?
Due to weak AI?
Due to other factors? (See Relevant pre-AGI possibilities.)
Will weak AI make AI risk clearer, and especially make it more legible (relates to “warning shots”)?
Will AI progress cause substantial misuse or conflict? What would that look like?
Will AI progress enable pivotal acts or processes?
Misalignment risk sources/modes
What misalignment would occur by default, and how would it be bad? Possible (overlapping) scenarios include inner alignment failure, outer alignment failure, getting what you measure, influence-seeking, more Paul Christiano stories, multipolar failure, a treacherous turn, a sharp left turn (or distributional leap), and more.
What would a powerful, unaligned AI agent do?[Answer: the details are unpredictable, but the high-level outcome is pretty clear; it would very likely be catastrophic. (But note there is reasonable disagreement, e.g..)]Technical problems around AI systems doing what their controllers want[doesn’t fit into this list well]How to make powerful AI systems aligned to human preferencesOr: how to make powerful AI systems that do not cause a catastropheHow to make powerful AI systems interpretable to human understandingHow to solve problems arounddecision theory and strategic interaction(and whether they’re important)How to solveWei-Dai-style philosophy problems(and whether they’re important)How to solve problems arounddelegation involving multiple humans or multiple AI systems(and whether they’re important)Polarity (relates to takeoff, endogeneity, and timelines)
What determines or affects polarity?
What are the effects or implications of polarity on alignment and stabilization?
What are the effects or implications of polarity on what the long-term future looks like, conditional on achieving alignment and stabilization?
What would be good? Interventions?
Proximate and ultimate uses of powerful AI
What uses of powerful AI would be great? How good would various possible uses of powerful AI be?
Conditional on achieving alignment, what’s likely to occur (and what could foresighted actors cause to occur or not to occur)?
Agents vs tools and general systems vs narrow tools (relates to tool AI and Comprehensive AI Services)
Are general systems more powerful than similar narrow tools?
Are agents more powerful than similar tools?
Does generality appear by default in capable systems? (This relates to takeoff.)
Does agency (or goal-directedness or consequentialism or farsightedness) appear by default in capable systems?
AI labs’ behavior and racing for AI
How will labs think about AI?
What actions could labs perform; what are they likely to do by default?
What would it be better if labs did; what interventions are tractable?
States’ behavior
How will states think about AI?
What actions could states perform? What are they likely to do by default?
What would it be better if states did; what interventions are tractable?
Public opinion
How the public thinks about AI and framing; what the public thinks about AI, what memes would spread widely and how that depends on other facts about the world; how all of that translates into attitudes and policy preferences
Wakeup to capabilities
Wakeup to alignment risk and warning shots for alignment
What (facts about public opinion) would be good? Interventions?
Paths to powerful AI (relates to timelines, AI risk modes, and more)
How successful will reinforcement-learning agents built on large language models be?
How successful will comprehensive AI services be?
How successful will language model bureaucracies be?
How successful will STEM AI or Skunkworks-style AI be?
How successful will simulating evolution be?
How successful will whole-brain emulation or neuromorphic AI be?
When is thinking in terms of paths or roadmaps useful? What other high-level paths are relevant?
Meta and miscellanea
Epistemic stuff
Research methodology and organization: how to do research and organize researchers
Forecasting methodology: how to do forecasting better
Collective epistemology: how to share and aggregate beliefs and knowledge
Decision theory
Simulation hypothesis
Do we live in a simulation? (If so, what’s the deal?)
What should we do if we live in a simulation (as a function of what the deal is)?
Movement/community/field stuff
Maybe this list would be more useful if it had more pointers to relevant work?
Maybe this list would be more useful if it included stuff that’s important that I don’t feel uncertain about? But probably not much of that exists?
I like lists/trees/graphs. I like the ideas behind Clarifying some key hypotheses in AI alignment and Modelling Transformative AI Risks. Perhaps this list is part of the beginning of a tree/graph for AI forecasting not including alignment stuff.
Meta level. To carve nature at its joints, we must [use good nodes / identify the true nodes]. A node is [good insofar as / true if] its causes and effects are modular, or we can losslessly compress phenomena related to it into effects on it and effects from it.
“The cost of compute” is an example of a great node (in the context of the future of AI): it’s affected by various things (choices made by Nvidia, innovation, etc.), and it affects various things (capability-level of systems made by OpenAI, relative importance of money vs talent at AI labs, etc.), and we lose nothing by thinking in terms of the cost of compute (relative to, e.g., the effects of the choices made by Nvidia on the capability-level of systems made by OpenAI).
“When Moore’s law will end” is an example of something that is not a node (in the context of the future of AI), since you’d be much better off thinking in terms of the underlying causes and effects.
The relations relevant to nodes are analytical not causal. For example, “the cost of compute” is a node between “evidence about historical progress” and “timelines,” not just between “stuff Nvidia does” and “stuff OpenAI does.” (You could also make a causal model, but here I’m interested in analytical models.)
Object level. I’m not sure how good “timelines,” “takeoff,” “polarity,” and “wakeup to capabilities” are as nodes. Most of the time it seems fine to talk about e.g. “effects on timelines” and “implications of timelines.” But maybe this conceals confusion.
Maybe AI Will Happen Outside US/China
I’m interested in the claim important AI development (in the next few decades) will largely occur outside any of the states that currently look likely to lead AI development. I don’t think this is likely, but I haven’t seen discussion of this claim.[1] This would matter because it would greatly affect the environment in which AI is developed and affect which agents are empowered by powerful AI.
Epistemic status: brainstorm. May be developed into a full post if I learn or think more.
I. Causes
The big tech companies are in the US and China, and discussion often assumes that these two states have a large lead on AI development. So how could important development occur in another state? Perhaps other states’ tech programs (private or governmental) will grow. But more likely, I think, an already-strong company leaves the US for a new location.
My legal knowledge is insufficient to say how well companies can leave their states with any confidence. My impression is that large American companies largely can leave while large Chinese companies cannot.
Why might a big tech company or AI lab want to leave a state?[2]
Fleeing expropriation/nationalization. States can largely expropriate companies’ property within their territory unless they have contracted otherwise. A company may be able to protect its independence by securing legal protection from expropriation from another state, then moving its hardware to that state. It may move its headquarters or workers as well.
Fleeing domestic regulation on development and/or deployment of AI.
II. Effects
The state in which powerful AI is developed has two important effects.
States set regulations. The regulatory environment around an AI lab may affect the narrow AI systems it builds and/or how it pursues AGI.
State influence & power. The state in which AGI is achieved can probably nationalize that project (perhaps well before AGI). State control of powerful AI affects how it will be used.
III. AI deployment before superintelligence
Eliezer recently tweeted that AI might be low-impact until superintelligence because of constraints on deployment. This seems partially right — for example, medicine and education seem like areas in which marginal improvements in our capabilities have only small effects due to civilizational inadequacy. Certainly some AI systems would require local regulatory approval to be useful; those might well be limited in the US. But a large fraction of AI systems won’t be prohibited by plausible American regulation. For example, I would be quite surprised if the following kinds of systems were prohibited by regulation (disclaimer: I’m very non-expert on near-future AI):
Business services
Operations/logistics
Analysis
Productivity tools (e.g., Codex, search tools)
Online consumer services — financial, writing assistants (Codex)
Production of goods that can be shipped cheaply (like computers but not houses)
Trading
Maybe media stuff (chatbots, persuasion systems). It’s really hard to imagine the US banning chatbots. I’m not sure how persuasion-AI is implemented; custom ads could conceivably be banned, but eliminating AI-written media is implausible.
This matters because these AI applications directly affect some places even if they couldn’t be developed in those places.
In the unlikely event that the US moves against not only the deployment but also the development of such systems, AI companies would be more likely to seek a way around regulation — such as relocating.
Rather, I have not seen reasons for this claim other than the very normal one — that leading states and companies change over time. If you have seen more discussion of this claim, please let me know.
This is most likely to be relevant to the US but applies generally.
Value Is Binary
Epistemic status: rough ethical and empirical heuristic.
Assuming that value is roughly linear in resources available after we reach technological maturity,[1] my probability distribution of value is so bimodal that it is nearly binary. In particular, I assign substantial probability to near-optimal futures (at least 99% of the value of the optimal future), substantial probability to near-zero-value futures (between −1% and 1% of the value of the optimal future), and little probability to anything else.[2] To the extent that almost all of the probability mass fits into two buckets, and everything within a bucket is almost equally valuable as everything else in that bucket, the goal maximize expected value reduces to the goal maximize probability of the better bucket.
So rather than thinking about how to maximize expected value, I generally think about maximizing the probability of a great (i.e., near-optimal) future. This goal is easier for me to think about, particularly since I believe that the paths to a great future are rather homogeneous — alike not just in value but in high-level structure. In the rest of this shortform, I explain my belief that the future is likely to be near-optimal or near-zero.
Substantial probability to near-optimal futures.
I have substantial credence that the future is at least 99% as good as the optimal future.[3] I do not claim much certainty about what the optimal future looks like — my baseline assumption is that it involves increasing and improving consciousness in the universe, but I have little idea whether that would look like many very small minds or a few very big minds. Or perhaps the optimal future involves astronomical-scale acausal trade. Or perhaps future advances in ethics, decision theory, or physics will have unforeseeable implications for how a technologically mature civilization can do good.
But uniting almost all of my probability mass for near-optimal futures is how we get there, at a high level: we create superintelligence, achieve technological maturity, solve ethics, and then optimize. Without knowing what this looks like in detail, I assign substantial probability to the proposition that humanity successfully completes this process. And I think almost all futures in which we do complete this process look very similar: they have nearly identical technology, reach the same conclusions on ethics, have nearly identical resources available to them (mostly depending on how long it took them to reach maturity), and so produce nearly identical value.
Almost all of the remaining probability to near-zero futures.
This claim is bolder, I think. Even if it seems reasonable to expect a substantial fraction of possible futures to converge to near-optimal, it may seem odd to expect almost all of the rest to be near-zero. But I find it difficult to imagine any other futures.
For a future to not be near-zero, it must involve using a nontrivial fraction of the resources available in the optimal future (by my assumption that value is roughly linear in resources). More significantly, the future must involve using resources at a nontrivial fraction of the efficiency of their use in the optimal future. This seems unlikely to happen by accident. In particular, I claim:
If a future does not involve optimizing for the good, value is almost certainly near-zero.
Roughly, this holds if all (nontrivially efficient) ways of promoting the good are not efficient ways of optimizing for anything else that we might optimize for. I strongly intuit that this is true; I expect that as technology improves, efficiently producing a unit of something will produce very little of almost all other things (where “thing” includes not just stuff but also minds, qualia, etc.).[4] If so, then value (or disvalue) is (in expectation) a negligible side effect of optimization for other things. And I cannot reasonably imagine a future optimized for disvalue, so I think almost all non-near-optimal futures are near-zero.
So I believe that either we optimize for value and get a near-optimal future, or we do anything else and get a near-zero future.
Intuitively, it seems possible to optimize for more than one value. I think such scenarios are unlikely. Even if our utility function has multiple linear terms, unless there is some surprisingly good way to achieve them simultaneously, we optimize by pursuing one of them near-exclusively.[5] Optimizing a utility function that looks more like min(x,y) may be a plausible result of a grand bargain, but such a scenario requires that, after we have mature technology, multiple agents have nontrivial bargaining power and different values. I find this unlikely; I expect singleton-like scenarios and that powerful agents will either all converge to the same preferences or all have near-zero-value preferences.
I mostly see “value is binary” as a heuristic for reframing problems. It also has implications for what we should do: to the extent that value is binary (and to the extent that doing so is feasible), we should focus on increasing the probability of great futures. If a “catastrophic” future is one in which we realize no more than a small fraction of our value, then a great future is simply one which is not catastrophic and we should focus on avoiding catastrophes. But of course, “value is binary” is an empirical approximation rather than an a priori truth. Even if value seems very nearly binary, we should not reject contrary proposed interventions[6] or possible futures out of hand.
I would appreciate suggestions on how to make these ideas more formal or precise (in addition to comments on what I got wrong or left out, of course). Also, this shortform relies on argument by “I struggle to imagine”; if you can imagine something I cannot, please explain your scenario and I will justify my skepticism or update.
You would reject this if you believed that astronomical-scale goods are not astronomically better than Earth-scale goods or if you believed that some plausible Earth-scale bad would be worse than astronomical-scale goods are good.
“Optimal” value is roughly defined as the expected value of the future in which we act as well as possible, from our current limited knowledge about what “acting well” looks like. “Zero” is roughly defined as any future in which we fail to do anything astronomically significant. I consider value relative to the optimal future, ignoring uncertainty about how good the optimal future is — we should theoretically act as if we’re in a universe with high variance in value between different possibilities, but I don’t see how this affects what we should choose before reaching technological maturity.*
*Except roughly that we should act with unrealistically low probability that we are in a kind of simulation in which our choices matter very little or have very differently-valued consequences than otherwise. The prospect of such simulations might undermine my conclusions—value might still be binary, but for the wrong reason—so it is useful to be able to almost-ignore such possibilities.
That is, at least 99% of the way from the zero-value future to the optimal future.
If we particularly believe that value is fragile, we have an additional reason to expect this orthogonality. But I claim that different goals tend to be orthogonal at high levels of technology independent of value’s fragility.
This assumes that all goods are substitutes in production, which I expect to be nearly true with mature technology.
That is, those that affect the probability of futures outside the binary or that affect how good the future is within the set of near-zero (or near-optimal) futures out of hand.
After reading the first paragraph of your above comment only, I want to note that:
I assign much lower probability to near-optimal futures than near-zero-value futures.
This is mainly because I imagine a lot of the “extremely good” possible worlds I imagine when reading Bostrom’s Letter from Utopia are <1% of what is optimal.
I also think the amount of probability I assign to 1%-99% futures is (~10x?) larger than the amount I assign to >99% futures.
(I’d like to read the rest of your comment later (but not right now due to time constraints) to see if it changes my view.)
I agree that near-optimal is unlikely. But I would be quite surprised by 1%-99% futures because (in short) I think we do better if we optimize for good and do worse if we don’t. If our final use of our cosmic endowment isn’t near-optimal, I think we failed to optimize for good and would be surprised if it’s >1%.
Agreed with this given how many orders of magnitude potential values span.
Rescinding my previous statement:
> I also think the amount of probability I assign to 1%-99% futures is (~10x?) larger than the amount I assign to >99% futures.
I’d now say that probably the probability of 1%-99% optimal futures is <10% of the probability of >99% optimal futures.
This is because 1% optimal is very close to being optimal (only 2 orders of magnitude away out of dozens of orders of magnitude of very good futures).
Related idea, off the cuff, rough. Not really important or interesting, but might lead to interesting insights. Mostly intended for my future selves, but comments are welcome.
Binaries Are Analytically Valuable
Suppose our probability distribution for alignment success is nearly binary. In particular, suppose that we have high credence that, by the time we can create an AI capable of triggering an intelligence explosion, we will have
really solved alignment (i.e., we can create an aligned AI capable of triggering an intelligence explosion at reasonable extra cost and delay) or
really not solved alignment (i.e., we cannot create a similarly powerful aligned AI, or doing so would require very unreasonable extra cost and delay)
(Whether this is actually true is irrelevant to my point.)
Why would this matter?
Stating the risk from an unaligned intelligence explosion is kind of awkward: it’s that the alignment tax is greater than what the leading AI project is able/willing to pay. Equivalently, our goal is for the alignment tax to be less than what the leading AI project is able/willing to pay. This gives rise to two nice, clean desiderata:
Decrease the alignment tax
Increase what the leading AI project is able/willing to pay for alignment
But unfortunately, we can’t similarly split the goal (or risk) into two goals (or risks). For example, a breakdown into the following two goals does not capture the risk from an unaligned intelligence explosion:
Make the alignment tax less than 6 months and a trillion dollars
Make the leading AI project able/willing to spend 6 months and a trillion dollars on aligning an AI
It would suffice to achieve both of these goals, but doing so is not necessary. If we fail to reduce the alignment tax this far, we can compensate by doing better on the willingness-to-pay front, and vice versa.
But if alignment success is binary, then we actually can decompose the goal (bolded above) into two necessary (and jointly sufficient) conditions:
Really solve alignment; i.e., reduce the alignment tax to [reasonable value]
Make the leading AI project able/willing to spend [reasonable value] on alignment
(Where [reasonable value] depends on what exactly our binary-ish probability distribution for alignment success looks like.)
Breaking big goals down into smaller goals—in particular, into smaller necessary conditions—is valuable, analytically and pragmatically. Binaries help, when they exist. Sometimes weaker conditions on the probability distribution, those of the form a certain important subset of possibilities has very low probability, can be useful in the same way.
How do corporate campaigns and leaderboards effect change?
https://ailabwatch.org/resources/integrity
Writing a thing on lab integrity issues. Planning to publish early Monday morning [edit: will probably hold off in case Anthropic clarifies nondisparagement stuff]. Comment onthis public google docor DM me.I’m particularly interested in stuff I’m missing or existing writeups on this topic.AI strategy research
projectsproject generatorspromptsMostly for personal use.
Some prompts inspired by Framing AI strategy:
Plans
What plans would be good?
Given a particular plan that is likely to be implemented, what interventions or desiderata complement that plan (by making it more likely to succeed or by being better in worlds where it succeeds)
Affordances: for various relevant actors, what strategically significant actions could they take? What levers do they have? What would it be great for them to do (or avoid)?
Intermediate goals: what goals or desiderata are instrumentally useful?
Threat modeling: for various threats, model them well enough to understand necessary and sufficient conditions for preventing them.
Memes (& frames): what would it be good if people believed or paid attention to?
For forecasting prompts, see List of uncertainties about the future of AI.
Some miscellaneous prompts:
Slowing AI
How can various relevant actors slow AI?
How can the AI safety community slow AI?
What considerations or side effects are relevant to slowing AI?
How do labs act, as a function of [whatever determines that]? In particular, what’s the deal with “racing”?
AI risk advocacy
How could the AI safety community do AI risk advocacy well?
What considerations or side effects are relevant to AI risk advocacy?
What’s the deal with crunch time?
How will the strategic landscape be different in the future?
What will be different near the end, and what interventions or actions will that enable? In particular, is eleventh-hour coordination possible? (Also maybe emergency brakes that don’t appear until the end.)
What concrete asks should we have for labs? for government?
Meta: how can you help yourself or others do better AI strategy research?
Four kinds of actors/processes/decisions are directly very important to AI governance:
Corporate self-governance
Adopting safety standards
Proving a model for government regulation
US policy (and China, EU, UK, and others to a lesser extent)
Regulation
Incorporating standards into law
Standard-setters setting standards
International relations
Treaties
Informal influence on safety standards
Related: How technical safety standards could promote TAI safety.
(“Safety standards” sounds prosaic but it doesn’t have to be.)
AI risk decomposition based on agency or powerseeking or adversarial optimization or something
Epistemic status: confused.
Some vague, closely related ways to decompose AI risk into two kinds of risk:
Risk due to AI agency vs risk unrelated to agency
Risk due to AI goal-directedness vs risk unrelated to goal-directedness
Risk due to AI planning vs risk unrelated to planning
Risk due to AI consequentialism vs risk unrelated to consequentialism
Risk due to AI utility-maximization vs risk unrelated to utility-maximization
Risk due to AI powerseeking vs risk unrelated to powerseeking
Risk due to AI optimizing against you vs risk unrelated to adversarial optimization
The central reason to worry about powerseeking/whatever AI, I think, is that sufficiently (relatively) capable goal-directed systems instrumentally converge to disempowering you.
The central reason to worry about non-powerseeking/whatever AI, I think, is failure to generalize correctly from training—distribution shift, Goodhart, You get what you measure.
Biological bounds on requirements for human-level AI
Facts about biology bound requirements for human-level AI. In particular, here are two prima facie bounds:
Lifetime. Humans develop human-level cognitive capabilities over a single lifetime, so (assuming our artificial learning algorithms are less efficient than humans’ natural learning algorithms) training a human-level model takes at least the inputs used over the course of babyhood-to-adulthood.
Evolution. Evolution found human-level cognitive capabilities by blind search, so (assuming we can search at least that well, and assuming evolution didn’t get lucky) training a human-level model takes at most the inputs used over the course of human evolution (plus Lifetime inputs, but that’s relatively trivial).
Genome. The size of the human genome is an upper bound on the complexity of humans’ natural learning algorithms. Training a human-level model takes at most the inputs needed to find a learning algorithm at most as complex as the human genome (plus Lifetime inputs, but that’s relatively trivial). (Unfortunately, the existence of human-level learning algorithms of certain simplicity says almost nothing about the difficulty of finding such algorithms.) (Ajeya’s “genome anchor” is pretty different—”a transformative model would . . . have about as many parameters as there are bytes in the human genome”—and makes no sense to me.)
(A human-level AI should use similar computation as humans per subjective time. This assumption/observation is weird and perhaps shows that something weird is going on, but I don’t know how to make that sharp.)
There are few sources of bounds on requirements for human-level AI. Perhaps fundamental limits or reasoning about blind search could give weak bounds, but biology is the only example of human-level cognitive abilities and so the only possible source of reasonable bounds.
Related: Ajeya Cotra’s Forecasting TAI with biological anchors (most relevant section) and Eliezer Yudkowsky’s Biology-Inspired AGI Timelines: The Trick That Never Works.
What do people (outside this community) think about AI? What will they think in the future?
Attitudes predictably affect relevant actors’ actions, so this is a moderately important question. And it’s rather neglected.
Groups whose attitudes are likely to be important include ML researchers, policymakers, and the public.
On attitudes among the public, surveys provide some information, but I suspect attitudes will change (in potentially predictable ways) as AI becomes more salient and some memes/framings get locked in. Perhaps some survey questions (maybe general sentiment on AI) are somewhat robust to changes in memes while others (maybe beliefs on how AI affects the economy or attitudes on regulation) may change a lot in the near future.
On attitudes among ML researchers, surveys (e.g.) provide some information, but for some reason most ML researchers say there’s at least a 5% probability of doom (or 10%, depending on how you ask) but this doesn’t seem to translate into their actions or culture. Perhaps interviews would reveal researchers’ attitudes better than closed-ended surveys (note to self: talk to Vael Gates).
AI may become much more salient in the next few years, and memes/framings may get locked in.
Critically, this only is necessary if we assume that researchers care about basically everyone in the present (to a loose approximation.) If we instead model researchers as basically selfish by default, then the low chance of a technological singularity outweighs the high chance of death, especially for older folks.
Basically, this could be explained as a goal alignment problem: LW and AI Researchers have very different goals in mind.