I’m launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.
It’s a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.
(It’s much better on desktop than mobile — don’t read it on mobile.)
It’s in beta—leave feedback here or comment or DM me—but I basically endorse the content and you’re welcome to share and discuss it publicly.
It’s unincorporated, unfunded, not affiliated with any orgs/people, and is just me.
Some clarifications and disclaimers.
How you can help:
Give feedback on how this project is helpful or how it could be different to be much more helpful
Tell me what’s wrong/missing; point me to sources on what labs should do or what they are doing
Suggest better evaluation criteria
Share this
Help me find an institutional home for the project
Offer expertise on a relevant topic
Offer to collaborate
Volunteer your webdev skills
(Pitch me on new projects or offer me a job)
(Want to help and aren’t sure how to? Get in touch!)
I think this project is the best existing resource for several kinds of questions, but I think it could be a lot better. I’m hoping to receive advice (and ideally collaboration) on taking it in a more specific direction. Also interested in finding an institutional home. Regardless, I plan to keep it up to date. Again, I’m interested in help but not sure what help I need.
I could expand the project (more categories, more criteria per category, more labs); I currently expect that it’s more important to improve presentation stuff but I don’t know how to do that; feedback will determine what I prioritize. It will also determine whether I continue spending most of my time on this or mostly drop it.
I just made a twitter account. I might use it to comment on stuff labs do.
Thanks to many friends for advice and encouragement. Thanks to Michael Keenan for doing most of the webdev. These people don’t necessarily endorse this project.
Various comments:
I wouldn’t call this “AI lab watch.” “Lab” has the connotation that these are small projects instead of multibillion dollar corporate behemoths.
“deployment” initially sounds like “are they using output filters which harm UX in deployment”, but instead this seems to be penalizing organizations if they open source. This seems odd since open sourcing is not clearly bad right now. The description also makes claims like “Meta release all of their weights”—they don’t release many image/video models because of deepfakes, so they are doing some cost-benefit analysis. Zuck: “So we want to see what other people are observing, what we’re observing, what we can mitigate, and then we’ll make our assessment on whether we can make it open source.” If this is mainly a penalty against open sourcing the label should be clearer.
“Commit to do pre-deployment risk assessment” They’ve all committed to this in the WH voluntary commitments and I think the labs are doing things on this front.
“Do risk assessment” These companies have signed on to WH voluntary commitments so are all checking for these things, and the EO says to check for these hazards too. This is why it’s surprising to see Microsoft have 1% given that they’re all checking for these hazards.
Looking at the scoring criteria, this seems highly fixated on rogue AIs, but I understand I’m saying that to the original forum of these concerns. Risk assessment’s scoring doesn’t really seem to prioritize bio x-risk as much as scheming AIs. This is strange because if we’re focused on rogue AIs I’d put a half the priority of risk mitigation while the model is training. Many rogue AI people may think half of the time the AI will kill everyone is when the model is “training” (because it will escape during that time).
The first sentence of this site says the focus is on “extreme risks” but it seems the focus is mainly on rogue AIs. This should be upfront that this is from the perspective that loss of control is the main extreme risk, rather than positioning itself as a comprehensive safety tracker. If I were tracking rogue AI risks, I’d probably drill down to what they plan to do with automated AI R&D/intelligence explosions.
“Training” This seems to give way more weight to rogue AI stuff. Red teaming is actually assessable, but instead you’re giving twice the points to if they have someone “work on scalable oversight.” This seems like an EA vibes check rather than actually measuring something. This also seems like triple counting since it’s highly associated with the “scalable alignment” section and the “alignment program” section. This doesn’t even require that they use the technique for the big models they train and deploy. Independently, capabilities work related to building superintelligences can easily be framed as scalable oversight, so this doesn’t set good incentives. Separately, at the end this also gives lots of points for voluntary (read: easily breakable) commitments. These should not be trusted and I think the amount of lipservice points is odd.
“Security” As I said on EAF the security scores are suspicious to me and even look backward. The major tech companies have much more experience protecting assets (e.g., clouds need to be highly secure) than startups like Anthropic and OpenAI. It takes years building up robust information security and the older companies have a sizable advantage.
“internal governance” scores seem odd. Older, larger institutions such as Microsoft and Google have many constraints and processes and don’t have leaders who can unilaterally make decisions as easily, compared to startups. Their CEOs are also more fireable (OpenAI), and their board members aren’t all selected by the founder (Anthropic). This seems highly keyed into if they are just a PBC or non-profit. In practice PBC just makes it harder to sue, but Zuck has such control of his company that getting successfully sued for not upholding his fiduciary duty to shareholders seems unlikely. It seems 20% of the points is not using non-disparagement agreements?? 30% is for whistleblower policies; CA has many whistleblower protections if I recall correctly. No points for a chief risk officer or internal audit committee?
“Alignment program” “Other labs near the frontier publish basically no alignment research” Meta publishes dozens of papers they call “alignment”; these actually don’t feel that dissimilar to papers like Constitutional AI-like papers (https://twitter.com/jaseweston/status/1748158323369611577 https://twitter.com/jaseweston/status/1770626660338913666 https://arxiv.org/pdf/2305.11206 ). These papers aren’t posted to LW but they definitely exist. To be clear I think this is general capabilities but this community seems to think differently. Alignment cannot be “did it come from EA authors” and it probably should not be “does it use alignment in its title.” You’ll need to be clear how this distinction is drawn.
Meta has people working on safety and CBRN+cyber + adversarial robustness etc. I think they’re doing a good job (here are two papers from the last month: https://arxiv.org/pdf/2404.13161v1 https://arxiv.org/pdf/2404.16873).
As is, I think this is a little too quirky and not ecumenical enough for it to generate social pressure.
There should be points for how the organizations act wrt to legislation. In the SB 1047 bill that CAIS co-sponsored, we’ve noticed some AI companies to be much more antagonistic than others. I think is is probably a larger differentiator for an organization’s goodness or badness.
(Won’t read replies since I have a lot to do today.)
This kind of feedback is very helpful to me; thank you! Strong-upvoted and weak-agreevoted.
(I have some factual disagreements. I may edit them into this comment later.)
(If you think Dan’s comment makes me suspect this project is full of issues/mistakes, react 💬 and I’ll consider writing a detailed soldier-ish reply.)
I initially thought this was wrong, but on further inspection, I agree and this seems to be a bug.
The deployment criteria starts with:
This criteria seems to allow to lab to meet it by having a good risk assesment criteria, but the rest of the criteria contains specific countermeasures that:
Are impossible to consistently impose if you make weights open (e.g. Enforcement and KYC).
Don’t pass cost benefit for current models which pose low risk. (And it seems the criteria is “do you have them implemented right now?)
If the lab had an excellent risk assement policy and released weights if the cost/benefit seemed good, that should be fine according to the “deployment” criteria IMO.
Generally, the deployment criteria should be gated behind “has a plan to do this when models are actually powerful and their implementation of the plan is credible”.
I get the sense that this criteria doesn’t quite handle the necessarily edge cases to handle reasonable choices orgs might make.
(This is partially my fault as I didn’t notice this when providing feedback on this project.)
(IMO making weights accessible is probably good on current margins, e.g. llama-3-70b would be good to release so long as it is part of an overall good policy, is not setting a bad precedent, and doesn’t leak architecture secrets.)
(A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I’m afraid.)
Related: maybe a lab should get full points for a risky release if the lab says it’s releasing because the benefits of [informing / scaring / waking-up] people outweigh the direct risk of existential catastrophe and other downsides. It’s conceivable that a perfectly responsible lab would do such a thing.
Capturing all nuances can trade off against simplicity and legibility. (But my criteria are not yet on the efficient frontier or whatever.)
Thanks. I agree you’re pointing at something flawed in the current version and generally thorny. Strong-upvoted and strong-agreevoted.
I didn’t put much effort into clarifying this kind of thing because it’s currently moot—I don’t think it would change any lab’s score—but I agree.[1] I think e.g. a criterion “use KYC” should technically be replaced with “use KYC OR say/demonstrate that you’re prepared to implement KYC and have some capability/risk threshold to implement it and [that threshold isn’t too high].”
Yeah. The criteria can be like “implement them or demonstrate that you could implement them and have a good plan to do so,” but it would sometimes be reasonable for the lab to not have done this yet. (Especially for non-frontier labs; the deployment criteria mostly don’t work well for evaluating non-frontier labs. Also if demonstrating that you could implement something is difficult, even if you could implement it.)
I’m interested in suggestions :shrug:
And I think my site says some things that contradict this principle, like ‘these criteria require keeping weights private.’ Oops.
Hmm, yeah it does seem thorny if you can get the points by just saying you’ll do something.
Like I absolutely think this shouldn’t count for security. I think you should have to demonstrate actual security of model weights and I can’t think of any demonstration of “we have the capacity to do security” which I would find fully convincing. (Though setting up some inference server at some point which is secure to highly resourced pen testers would be reasonably compelling for demonstrating part of the security portfolio.)
@Dan H are you able to say more about which companies were most/least antagonistic?
Disagree on “lab”. I think it’s the standard and most natural term now. As evidence, see your own usage a few sentences later:
If there’s a good writeup on labs’ policy advocacy I’ll link to and maybe defer to it.
This seems like a good point. Here’s a quick babble of alts (folks could react with a thumbs-up on ones that they think are good).
AI Corporation Watch | AI Mega-Corp Watch | AI Company Watch | AI Industry Watch | AI Firm Watch | AI Behemoth Watch | AI Colossus Watch | AI Juggernaut Watch | AI Future Watch
I currently think “AI Corporation Watch” is more accurate. “Labs” feels like a research team, but I think these orgs are far far far more influenced by market forces than is suggested by “lab”, and “corporation” communicates that. I also think the goal here is not to point to all companies that do anything with AI (e.g. midjourney) but to focus on the few massive orgs that are having the most influence on the path and standards of the industry, and to my eye “corporation” has that association more than “company”. Definitely not sure though.
These are either tendentious (“Juggernaut”) or unnecessarily specific to the present moment (“Mega-Corp”).
How about simply “AI Watch”?
Yep, lots of people independently complain about “lab.” Some of those people want me to use scary words in other places too, like replacing “diffusion” with “proliferation.” I wouldn’t do that, and don’t replace “lab” with “mega-corp” or “juggernaut,” because it seems [incorrect / misleading / low-integrity].
I’m sympathetic to the complaint that “lab” is misleading. (And I do use “company” rather than “lab” occasionally, e.g. in the header.) But my friends usually talk about “the labs,” not “the companies.” But to most audiences “company” is more accurate.
I currently think “company” is about as good as “lab.” I may change the term throughout the site at some point.
I do think being one syllable is pretty valuable. Although AI Org watch might be fine (kinda rolls off the tongue worse)
“AI Watch.”
Could consider “frontier AI watch”, “frontier AI company watch”, or “AGI watch.”
Most people in the world (including policymakers) have a much broader conception of AI. AI means machine learning, AI is the thing that 1000s of companies are using and 1000s of academics are developing, etc etc.
[I’ve talked to Zach about this project]
I think this is cool, thanks for building it! In particular, it’s great to have a single place where all these facts have been collected.
I can imagine this growing into the default reference that people use when talking about whether labs are behaving responsibly.
One reason that I’m particularly excited for this: AI-x-risk-concerned people are often accused of supporting Anthropic over other labs for reasons that are related to social affiliation rather than substantive differences. I think these accusations have some merit—if you ask AI-x-risk-concerned people for exactly how Anthropic differs from e.g. OpenAI, they often turn out to have a pretty shallow understanding of the difference. This resource makes it easier for these people to have a firmer understanding of concrete differences.
I hope also that this project makes it easier for AI-x-risk-concerned people to better allocate their social pressure on labs.
I hope that this resource is used as a measure of relative responsibleness, and this doesn’t get mixed up with absolute responsibleness. My understanding is that the resource essentially says, “here’s some things that would be good– let’s see how the labs compare on each dimension.” The resource is not saying “if a lab gets a score above X% on each metric, then we are quite confident that the lab will not cause an existential catastrophe.”
Moreover, my understanding is that the resource is not taking a position on whether or not it is “responsible”– in some absolute sense– for a lab to be scaling toward AGI in our current world. I see the resource as saying “conditional on a lab scaling toward AGI, are they doing so in a way that is relatively more/less responsible compared to the others that are scaling toward AGI.”
This might be a pedantic point, but I think it’s an important one to emphasize– a lab can score in 1st place and still present a risk to humanity that reasonable people would still deem unacceptable & irresponsible (or put differently, a lab can score in 1st place and still produce a catastrophe).
Suggest having a row for “Transparency”, to cover things like whether the company encourages or discourages whistleblowing, does it report bad news about alignment/safety (such as negative research results) or only good news (new ideas and positive results), does it provide enough info to the public to judge the adequacy of its safety culture and governance, etc.
Thanks.
Any takes on what info a company could publish to demonstrate “the adequacy of its safety culture and governance”? (Or recommended reading?)
Ideally criteria are objectively evaluable / minimize illegible judgment calls.
Unfortunately I don’t have well-formed thoughts on this topic. I wonder if there are people who specialize in AI lab governance and have written about this, but I’m not personally aware of such writings. To brainstorm some ideas:
Conduct and publish anonymous surveys of employee attitudes about safety.
Encourage executives, employees, board members, advisors, etc., to regularly blog about governance and safety culture, including disagreements over important policies.
Officially encourage (e.g. via financial rewards) internal and external whistleblowers. Establish and publish policies about this.
Publicly make safety commitments and regularly report on their status, such as how much compute and other resources have been allocated/used by which safety teams.
Make/publish a commitment to publicly report negative safety news, which can be used as basis for whistleblowing if needed (i.e. if some manager decides to hide such news instead).
Publish important governance documents. (Seemed too basic to mention, but apparently not.)
As far as transparency and commitments, you could ask for:
A track record of making and holding non-trivial commitments. (Of any kind.)
some sort of proposal like this one
Two noncentral pages I like on the site:
Other scorecards & evaluation, collecting other safety-ish scorecard-ish resources.
Commitments, collecting AI companies’ commitments relevant to AI safety and extreme risks.
So Alignment program is to be updated to 0 for OpenAI now that Superalignment team is no more? ( https://docs.google.com/document/d/1uPd2S00MqfgXmKHRkVELz5PdFRVzfjDujtu8XLyREgM/edit?usp=sharing )
Thanks for doing this!
One suggestion: it would be very useful if people could interactively experiment with modifications, eg if they thought scalable alignment should be weighted more heavily, or if they thought Meta should receive 0% for training. An MVP version of this would just be a Google spreadsheet that people could copy and modify.
Thanks for the feedback. I’ll add “let people download all the data” to my todo list but likely won’t get to it. I’ll make a simple google sheet now.
Google sheet.
Some overall scores are one point higher. Probably because my site rounds down. Probably my site should round to the nearest integer...
Fantastic, thanks!
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?