I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.
William_S
Relevant paper discussing risks of risk assessments being wrong due to theory/model/calculation error. Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes
Based on the current vibes, I think that suggest that methodological errors alone will lead to significant chance of significant error for any safety case in AI.
IMO it’s unlikely that we’re ever going to have a safety case that’s as reliable as the nuclear physics calculations that showed that the Trinity Test was unlikely to ignite the atmosphere (where my impression is that the risk was mostly dominated by risk of getting the calculations wrong). If we have something that is less reliable, then will we ever be in a position where only considering the safety case gives a low enough probability of disaster for launching an AI system beyond the frontier where disastrous capabilities are demonstrated?
Thus, in practice, decisions will probably not be made on a safety case alone, but also based on some positive case of the benefits of deployment (e.g. estimated reduced x-risk, advancing the “good guys” in the race, CEO has positive vibes that enough risk mitigation has been done, etc.). It’s not clear what role governments should have in assessing this, maybe we can only get assessment of the safety case, but it’s useful to note that safety cases won’t be the only thing informs these decisions.This situation is pretty disturbing, and I wish we had a better way, but it still seems useful to push the positive benefit case more towards “careful argument about reduced x-risk” and away from “CEO vibes about whether enough mitigation has been done”.
Imo I don’t know if we have evidence that Anthropic deliberately cultivated or significantly benefitted from the appearance of a commitment. However if an investor or employee felt like they made substantial commitments based on this impression and then later felt betrayed that would be more serious. (The story here is I think importantly different from other stories where I think there were substantial benefits from commitment appearance and then violation)
Everyone is afraid of the AI race, and hopes that one of the labs will actually end up doing what they think is the most responsible thing to do. Hope and fear is one hell of a drug cocktail, makes you jump to the conclusions you want based on the flimsiest evidence. But the hangover is a bastard.
Really, the race started more when OpenAI released GPT-4, it’s been going on for a while, this is just another event that makes it clear.
Would be interesting philosophical experiment to have models trained on model spec v1 then try to improve their model spec for version v2, will this get better or go off the rails?
You get more discrete transitions when one s-curve process takes the lead from another s-curve process, e.g. deep learning taking over from other AI methods.
Probably shouldn’t limit oneself from thinking only in terms of 3 game phases or fitting into one specific game, in general can have n-phases where different phrases have different characteristics.
If anyone wants to work on this, there’s a contest with $50K and $20K prizes for creating safety relevant benchmarks. https://www.mlsafety.org/safebench
I think that’s how people should generally react in the absence of harder commitments and accountability measures.
I think the right way to think about verbal or written commitments is that they increase the costs of taking a certain course of action. A legal contract can mean that the price is civil lawsuits leading to paying a financial price. A non-legal commitment means if you break it, the person you made the commitment to gets angry at you, and you gain a reputation for being the sort of person who breaks commitments. It’s always an option for someone to break the commitment and pay the price, even laws leading to criminal penalties can be broken if someone is willing to run the risk or pay the price.
In this framework, it’s reasonable to be somewhat angry at someone or some corporation who breaks a soft commitment to you, in order to increase the perceived cost of breaking soft commitments to you and people like you.
People on average maybe tend more towards keeping important commitments due to reputational and relationship cost, but maybe corporations as groups of people tend to think only in terms of financial and legal costs, so are maybe more willing to break soft commitments (especially, if it’s an organization where one person makes the commitment but then other people break it). So for relating to corporations, you should be more skeptical of non-legally binding commitments (and even for legally binding commitments, pay attention to the real price of breaking it).
Yeah, I think it’s good if labs are willing to make more “cheap talk” statements of vague intentions, so you can learn how they think. Everyone should understand that these aren’t real commitments, and not get annoyed if these don’t end up meaning anything. This is probably the best way to view “statements by random lab employees”.
Imo would be good to have more “changeable commitments” too in between, statements that are “we’ll do policy X until we change the policy, when we do we commit to clearly informing everyone about the change” which is maybe more the current status of most RSPs.
I’d have more confidence in Anthropic’s governance if the board or LTBT had some fulltime independent members who weren’t employees. IMO labs should consider paying a fulltime salary but no equity to board members, through some kind of mechanism where the money is still there and paid for X period of time in the future, even if the lab dissolved, so no incentive to avoid actions that would cost the lab. Board salaries could maybe be pegged to some level of technical employee salary, so that technical experts could take on board roles. Boards full of busy people really can’t do their job of checking whether the organization is fullfilling its stated mission, and IMO this is one of the most important jobs in the world right now. Also, fulltime board members would have fewer conflicts of interest outside of the lab (since they won’t be in some other fulltime job that might conflict).
Like, in Chess you start off with a state where many pieces can’t move in the early game, in the middle game many pieces are in play moving around and trading, then in the end game it’s only a few pieces, you know what the goal is, roughly how things will play out.
In AI it’s like only a handful of players, then ChatGPT/GPT-4 came out and now everyone is rushing to get in (my mark of the start of the mid-game), but over time probably many players will become irrelevant or fold as the table stakes (training costs) get too high.
In my head the end-game is when the AIs themselves start becoming real players.
Also you would need clarity on how to measure the commitment.
It’s quite possible that anthropic has some internal definition of “not meaningfully advancing the capabilities frontier” that is compatible with this release. But imo they shouldn’t get any credit unless they explain it.
Would be nice, but I was thinking of metrics that require “we’ve done the hard work of understanding our models and making them more reliable”, better neuron explanation seems more like it’s another smartness test.
IMO it might be hard for Anthropic to communicate things about not racing because it might piss off their investors (even if in their hearts they don’t want to race).
https://x.com/alexalbert__/status/1803837844798189580
Not sure about the accuracy of this graph, but the general picture seems to match what companies claim, and the vibe is racing.
Do think that there are distinct questions about “is there a race” vs. “will this race action lead to bad consequences” vs. “is this race action morally condemnable”. I’m hoping that this race action is not too consequentially bad, maybe it’s consequentially good, maybe it still has negative Shapely value even if expected value is okay. There is some sense in which it is morally icky.
Would be nice if it was based on “actual robot army was actually being built and you have multiple confirmatory sources and you’ve tried diplomacy and sabotage and they’ve both failed” instead of “my napkin math says they could totally build a robot army bro trust me bro” or “they totally have WMDs bro” or “we gotta blow up some Japanese civilians so that we don’t have to kill more Japanese civilians when we invade Japan bro” or “dude I’m seeing some missiles on our radar, gotta launch ours now bro”.