I’m somewhat confused about when these evaluations are preformed (i.e., how much safety training the model has undergone). OpenAI’s paper says: “Red teamers had access to various snapshots of the model at different stages of training and mitigation maturity starting in early August through mid-September 2024,” so it seems like this evaluation was probably performed several times. Were these results obtained only prior to safety training or after? The latter seems more concerning to me, so I’m curious.
aysja
You actually want evaluators to have as much skin in the game as other employees so that when they take actions that might shut the company down or notably reduce the value of equity, this is a costly signal.
Further, it’s good if evaluators are just considered normal employees and aren’t separated out in any specific way. Then, other employees at the company will consider these evaluators to be part of their tribe and will feel some loyalty. (Also reducing the chance that risk evaluators feel like they are part of rival “AI safety” tribe.) This probably has a variety of benefits in terms of support from the company. For example, when evaluators make a decision with is very costly for the company, it is more likely to respected by other employees.
This situation seems backwards to me. Like, presumably the ideal scenario is that a risk evaluator estimates the risk in an objective way, and then the company takes (hopefully predefined) actions based on that estimate. The outcome of this interaction should not depend on social cues like how loyal they seem, or how personally costly it was for them to communicate that information. To the extent it does, I think this is evidence that the risk management framework is broken.
Thanks for writing this! I think it’s important for AI labs to write and share their strategic thoughts; I appreciate you doing so. I have many disagreements, but I think it’s great that the document is clear enough to disagree with.
You start the post by stating that “Our ability to do our safety work depends in large part on our access to frontier technology,” but you don’t say why. Like, there’s a sense in which much of this plan is predicated on Anthropic needing to stay at the frontier, but this document doesn’t explain why this is the right call to begin with. There are clearly some safety benefits to having access to frontier models, but the question is: are those benefits worth the cost? Given that this is (imo) by far the most important strategic consideration for Anthropic, I’m hoping for far more elaboration here. Why does Anthropic believe it’s important to work on advancing capabilities at all? Why is it worth the potentially world-ending costs?
This section also doesn’t explain why Anthropic needs to advance the frontier. For instance, it isn’t clear to me that anything from “Chapter 1” requires this—does remaining slightly behind the frontier prohibit Anthropic from e.g. developing automated red-teaming, or control techniques, or designing safety cases, etc.? Why? Indeed, as I understand it, Anthropic’s initial safety strategy was to remain behind other labs. Now Anthropic does push the frontier, but as far as I know no one has explained what safety concerns (if any) motivated this shift.
This is especially concerning because pushing the frontier seems very precarious, in the sense you describe here:
If [evaluations are] significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.
… and here:
As with other aspects of the RSP described above, there are significant costs to both evaluations that trigger too early and evaluations that trigger too late.
But without a clear sense of why advancing the frontier is helpful for safety in the first place, it seems pretty easy to imagine missing this narrow target.
Like, here is a situation I feel worried about. We continue to get low quality evidence about the danger of these systems (e.g. via red-teaming). This evidence is ambiguous and confusing—if a system can in fact do something scary (such as insert a backdoor into another language model), what are we supposed to infer from that? Some employees might think it suggests danger, but others might think that it, e.g., wouldn’t be able to actually execute such plans, or that it’s just a one-off fluke but still too incompetent to pose real threat, etc. How is Anthropic going to think about this? The downside of being wrong is, as you’ve stated, extreme: a long enough pause could kill the company. And the evidence itself is almost inevitably going to be quite ambiguous, because we don’t understand what’s happening inside the model such that it’s producing these outputs.
But so long as we don’t understand enough about these systems to assess their alignment with confidence, I am worried that Anthropic will keep deciding to scale. Because when the evidence is as indirect and unclear as that which is currently possible to gather, interpreting it is basically just a matter of guesswork. And given the huge incentive to keep scaling, I feel skeptical that Anthropic will end up deciding to interpret anything but unequivocal evidence as suggesting enough danger to stop.
This is concerning because Anthropic seems to anticipate such ambiguity, as suggested by the RSP lacking any clear red lines. Ideally, if Anthropic finds that their model is capable of, e.g., self-replication, then this would cause some action like “pause until safety measures are ready.” But in fact what happens is this:
If sufficient measures are not yet implemented, pause training and analyze the level of risk presented by the model. In particular, conduct a thorough analysis to determine whether the evaluation was overly conservative, or whether the model indeed presents near-next-ASL risks.
In other words, one of the first steps Anthropic plans to take if a dangerous evaluation threshold triggers, is to question whether that evaluation was actually meaningful in the first place. I think this sort of wiggle room, which is pervasive throughout the RSP, renders it pretty ineffectual—basically just a formal-sounding description of what they (and other labs) were already doing, which is attempting to crudely eyeball the risk.
And given that the RSP doesn’t bind Anthropic to much of anything, so much of the decision making largely hinges on the quality of its company culture. For instance, here is Nick Joseph’s description:
Fortunately, I think my colleagues, both on the RSP and elsewhere, are both talented and really bought into this, and I think we’ll do a great job on it. But I do think the criticism is valid, and that there is a lot that is left up for interpretation here, and it does rely a lot on people having a good-faith interpretation of how to execute on the RSP internally.
[...]
But I do agree that ultimately you need to have a culture around thinking these things are important and having everyone bought in. As I said, some of these things are like, did you solicit capabilities well enough? That really comes down to a researcher working on this actually trying their best at it. And that is quite core, and I think that will just continue to be.
Which is to say that Anthropic’s RSP doesn’t appear to me to pass the LeCun test. Not only is the interpretation of the evidence left up to Anthropic’s discretion (including retroactively deciding whether a test actually was a red line), but the quality of the safety tests themselves are also a function of company culture (i.e., of whether researchers are “actually trying their best” to “solicit capabilities well enough.”)
I think the LeCun test is a good metric, and I think it’s good to aim for. But when the current RSP is so far from passing it, I’m left wanting to hear more discussion of how you’re expecting it to get there. What do you expect will change in the near future, such that balancing these delicate tradeoffs—too lax vs. too strict, too vague vs. too detailed, etc.—doesn’t result in another scaling policy which also doesn’t constrain Anthropic’s ability to scale roughly at all? What kinds of evidence are you expecting you might encounter, that would actually count as a red line? Once models become quite competent, what sort of evidence will convince you that the model is safe? Aligned? And so on.
I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.
Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.
And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”
Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve.
I think it’s really alarming that this safety framework contains no commitments, and I’m frustrated that concern about this is brushed aside. If DeepMind is aiming to build AGI soon, I think it’s reasonable to expect they have a mitigation plan more concrete than a list of vague IOU’s. For example, consider this description of a plan from the FSF:
When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results.
This isn’t a baseline of reasonable practices upon which better safety practices can evolve, so much as a baseline of zero—the fact that DeepMind will take some unspecified action, rather than none, if an evaluation triggers does not appear to me to be much of a safety strategy at all.
Which is especially concerning, given that DeepMind might create AGI within the next few years. If it’s too early to make commitments now, when do you expect that will change? What needs to happen, before it becomes reasonable to expect labs to make safety commitments? What sort of knowledge are you hoping to obtain, such that scaling becomes a demonstrably safe activity? Because without any sense of when these “best practices” might emerge, or how, the situation looks alarmingly much like the labs requesting a blank check for as long as they deem fit, and that seems pretty unacceptable to me.
AI is a nascent field. But to my mind its nascency is what’s so concerning: we don’t know how to ensure these systems are safe, and the results could be catastrophic. This seems much more reason to not build AGI than it does to not offer any safety commitments. But if the labs are going to do the former, then I think they have a responsibility to provide more than a document of empty promises—they should be able to state, exactly, what would cause them to stop building these potentially lethal machines. So far no lab has succeeded at this, and I think that’s pretty unacceptable. If labs would like to take up the mantle of governing their own safety, then they owe humanity the service of meaningfully doing it.
I’m sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don’t know what the decision process inside of Anthropic will look like if an evaluation indicates something like “yeah, it’s excellent at inserting backdoors, and also, the vibe is that it’s overall pretty capable.” And given that Anthropic is making these decisions on behalf of everyone, Anthropic (like all labs) really owes it to humanity to be more upfront about how it’ll make these decisions (imo).
I will also note what I feel is a somewhat concerning trend. It’s happened many times now that I’ve critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: “this wouldn’t seem so bad if you knew what was happening behind the scenes.” They of course cannot tell me what the “behind the scenes” information is, so I have no way of knowing whether that’s true. And, maybe I would in fact update positively about Anthropic if I knew. But I do think the shape of “we’re doing something which might be incredibly dangerous, many external bits of evidence point to us not taking the safety of this endeavor seriously, but actually you should think we are based on me telling you we are” is pretty sketchy.
Largely agree with everything here.
But, I’ve heard some people be concerned “aren’t basically all SSP-like plans basically fake? is this going to cement some random bureaucratic bullshit rather than actual good plans?.” And yeah, that does seem plausible.
I do think that all SSP-like plans are basically fake, and I’m opposed to them becoming the bedrock of AI regulation. But I worry that people take the premise “the government will inevitably botch this” and conclude something like “so it’s best to let the labs figure out what to do before cementing anything.” This seems alarming to me. Afaict, the current world we’re in is basically the worst case scenario—labs are racing to build AGI, and their safety approach is ~“don’t worry, we’ll figure it out as we go.” But this process doesn’t seem very likely to result in good safety plans either; charging ahead as is doesn’t necessarily beget better policies. So while I certainly agree that SSP-shaped things are woefully inadequate, it seems important, when discussing this, to keep in mind what the counterfactual is. Because the status quo is not, imo, a remotely acceptable alternative either.
Or independent thinkers try to find new frames because the ones on offer are insufficient? I think this is roughly what people mean when they say that AI is “pre-paradigmatic,” i.e., we don’t have the frames for filling to be very productive yet. Given that, I’m more sympathetic to framing posts on the margin than I am to filling ones, although I hope (and expect) that filling-type work will become more useful as we gain a better understanding of AI.
I think this letter is quite bad. If Anthropic were building frontier models for safety purposes, then they should be welcoming regulation. Because building AGI right now is reckless; it is only deemed responsible in light of its inevitability. Dario recently said “I think if [the effects of scaling] did stop, in some ways that would be good for the world. It would restrain everyone at the same time. But it’s not something we get to choose… It’s a fact of nature… We just get to find out which world we live in, and then deal with it as best we can.” But it seems to me that lobbying against regulation like this is not, in fact, inevitable. To the contrary, it seems like Anthropic is actively using their political capital—capital they had vaguely promised to spend on safety outcomes, tbd—to make the AI arms race counterfactually worse.
The main changes that Anthropic has proposed—to prevent the formation of new government agencies which could regulate them, to not be held accountable for unrealized harm—are essentially bids to continue voluntary governance. Anthropic doesn’t want a government body to “define and enforce compliance standards,” or to require “reasonable assurance” that their systems won’t cause a catastrophe. Rather, Anthropic would like for AI labs to only be held accountable if a catastrophe in fact occurs, and only so much at that, as they are also lobbying to have their liability depend on the quality of their self-governance: “but if a catastrophe happens in a way that is connected to a defect in a company’s SSP, then that company is more likely to be liable for it.” Which is to say that Anthropic is attempting to inhibit the government from imposing testing standards (what Anthropic calls “pre-harm”), and in general aims to inhibit regulation of AI before it causes mass casualty.
I think this is pretty bad. For one, voluntary self-governance is obviously problematic. All of the labs, Anthropic included, have significant incentive to continue scaling, indeed, they say as much in this document: “Many stakeholders reasonably worry that this [agency]… might end up… impeding innovation in general.” And their attempts to self-govern are so far, imo, exceedingly weak—their RSP commits to practically nothing if an evaluation threshold triggers, leaving all of the crucial questions, such as “what will we do if our models show catastrophic inclinations,” up to Anthropic’s discretion. This is clearly unacceptable—both the RSP in itself, but also Anthropic’s bid for it to continue to serve as the foundation of regulation. Indeed, if Anthropic would like for other companies to be safer, which I believed to be one of their main safety selling points, then they should be welcoming the government stepping in to ensure that.
Afaict their rationale for opposing this regulation is that the labs are better equipped to design safety standards than the government is: “AI safety is a nascent field where best practices are the subject of original scientific research… What is needed in such a new environment is iteration and experimentation, not prescriptive enforcement. There is a substantial risk that the bill and state agencies will simply be wrong about what is actually effective in preventing catastrophic risk, leading to ineffective and/or burdensome compliance requirements.” But there is also, imo, a large chance that Anthropic is wrong about what is actually effective at preventing catastrophic risk, especially so, given that they have incentive to play down such risks. Indeed, their RSP strikes me as being incredibly insufficient at assuring safety, as it is primarily a reflection of our ignorance, rather than one built from a scientific understanding, or really any understanding, of what it is we’re creating.
I am personally very skeptical that Anthropic is capable of turning our ignorance into the sort of knowledge capable of providing strong safety guarantees anytime soon, and soon is the timeframe by which Dario aims to build AGI. Such that, yes, I expect governments to do a poor job of setting industry standards, but only because I expect that a good job is not possible given our current state of understanding. And I would personally rather, in this situation where labs are racing to build what is perhaps the most powerful technology ever created, to err on the side of the government guessing about what to do, and beginning to establish some enforcement about that, than to leave it for the labs themselves to decide.
Especially so, because if one believes, as Dario seems to, that AI has a significant chance of causing massive harm, that it could “destroy us,” and that this might occur suddenly, “indications that we are in a pessimistic or near-pessimistic scenario may be sudden and hard to spot,” then you shouldn’t be opposing regulation which could, in principle, stop this from happening. We don’t necessarily get warning shots with AI, indeed, this is one of the main problems with building it “iteratively,” one of the main problems with Anthropic’s “empirical” approach to AI safety. Because what Anthropic means by “a pessimistic scenario” is that “it’s simply an empirical fact that we cannot control or dictate values to a system that’s broadly more intellectually capable than ourselves.” Simply an empirical fact. And in what worlds do we learn this empirical fact without catastrophic outcomes?
I have to believe that Anthropic isn’t hoping to gain such evidence by way of catastrophes in fact occurring. But if they would like for such pre-harm evidence to have a meaningful impact, then it seems like having pre-harm regulation in place would be quite helpful. Because one of Anthropic’s core safety strategies rests on their ability to “sound the alarm,” indeed, this seems to account for something like ~33% of their safety profile, given that they believe “pessimistic scenarios” are around as likely as good, or only kind of bad scenarios. And in “pessimistic” worlds, where alignment is essentially unsolvable, and catastrophes are impending, their main fallback is to alert the world of this unfortunate fact so that we can “channel collective effort” towards some currently unspecified actions. But the sorts of actions that the world can take, at this point, will be quite limited unless we begin to prepare for them ahead of time.
Like, the United States government usually isn’t keen on shutting down or otherwise restricting companies on the basis of unrealized harm. And even if they were keen, I’m not sure how they would do this—legislation likely won’t work fast enough, and even if the President could sign an executive order to e.g. limit OpenAI from releasing or further creating their products, this would presumably be a hugely unpopular move without very strong evidence to back it up. And it’s pretty difficult for me to see what kind of evidence this would have to be, to take a move this drastic and this quickly. Anything short of the public witnessing clearly terrible effects, such as mass casualty, doesn’t seem likely to pass muster in the face of a political move this extreme.
But in a world where Anthropic is sounding alarms, they are presumably doing so before such catastrophes have occurred. Which is to say that without structures in place to put significant pressure on or outright stop AI companies on the basis of unrealized harm, Anthropic’s alarm sounding may not amount to very much. Such that pushing against regulation which is beginning to establish pre-harm standards makes Anthropic’s case for “sounding the alarm”—a large fraction of their safety profile—far weaker, imo. But I also can’t help but feeling that these are not real plans; not in the beliefs-pay-rent kind of way, at least. It doesn’t seem to me that Anthropic has really gamed out what such a situation would look like in sufficient detail for it to be a remotely acceptable fallback in the cases where, oops, AI models begin to pose imminent catastrophic risk. I find this pretty unacceptable, and I think Anthropic’s opposition to this bill is yet another case where they are at best placing safety second fiddle, and at worst not prioritizing it meaningfully at all.
- Jul 28, 2024, 6:43 AM; 3 points) 's comment on Linch’s Quick takes by (EA Forum;
I agree that scoring “medium” seems like it would imply crossing into the medium zone, although I think what they actually mean is “at most medium.” The full quote (from above) says:
In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant postmitigation risk level to be back at most to “medium” level.
I.e., I think what they’re trying to say is that they have different categories of evals, each of which might pass different thresholds of risk. If any of those are “high,” then they’re in the “medium zone” and they can’t deploy. But if they’re all medium, then they’re in the “below medium zone” and they can. This is my current interpretation, although I agree it’s fairly confusing and it seems like they could (and should) be more clear about it.
Maybe I’m missing the relevant bits, but afaict their preparedness doc says that they won’t deploy a model if it passes the “medium” threshold, eg:
Only models with a post-mitigation score of “medium” or below can be deployed. In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place.
The threshold for further developing is set to “high,” though. I.e., they can further develop so long as models don’t hit the “critical” threshold.
But you could just as easily frame this as “abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments”.
But doesn’t this argument hold with the opposite conclusion, too? E.g. “abstract reasoning about unfamiliar domains is hard therefore you should distrust arguments about good-by-default worlds”.
But whether an organization can easily respond is pretty orthogonal to whether they’ve done something wrong. Like, if 80k is indeed doing something that merits a boycott, then saying so seems appropriate. There might be some debate about whether this is warranted given the facts, or even whether the facts are right, but it seems misguided to me to make the strength of an objection proportional to someone’s capacity to respond rather than to the badness of the thing they did.
This comment appears to respond to habryka, but doesn’t actually address what I took to be his two main points—that Anthropic was using NDAs to cover non-disparagement agreements, and that they were applying significant financial incentive to pressure employees into signing them.
We historically included standard non-disparagement agreements by default in severance agreements
Were these agreements subject to NDA? And were all departing employees asked to sign them, or just some? If the latter, what determined who was asked to sign?
Agreed. I’d be especially interested to hear this from people who have left Anthropic.
I do kind of share the sense that people mostly just want frisbee and tea, but I am still confused about it. Wasn’t religion a huge deal for people for most of history? I could see a world where they were mostly just going through the motions, but the amount of awe I feel going into European churches feels like some evidence against this. And it’s hard for me to imagine that people were kind of just mindlessly sitting there through e.g., Gregorian chanting in candlelight, but maybe I am typical minding too hard. It really seems like these rituals, the architecture, all of it, was built to instill the sort of existential intensity that taking God seriously requires, and I have to imagine that this was at least somewhat real for most people?
And I do wonder whether the shift towards frisbee and tea has more to do more with a lack of options as compelling as cathedrals (on this axis at least), rather than the people’s lack of wanting it? Like, I don’t think I would get as much out of cathedrals as I expect some of these people did, because I’m not religious, but if something of that intensity existed which fit with my belief system I feel like I’d be so into it.
Right now it seems like the entire community is jumping to conclusions based on a couple of “impressions” people got from talking to Dario, plus an offhand line in a blog post. With that little evidence, if you have formed strong expectations, that’s on you.
Like Robert, the impressions I had were based on what I heard from people working at Anthropic. I cited various bits of evidence because those were the ones available, not because they were the most representative. The most representative were those from Anthropic employees who concurred that this was indeed the implication, but it seemed bad form to cite particular employees (especially when that information was not public by default) rather than, e.g., Dario. I think Dustin’s statement was strong evidence of this impression, though, and I still believe Anthropic to have at least insinuated it.
I agree with you that most people are not aiming for as much stringency with their commitments as rationalists expect. Separately, I do think that what Anthropic did would constitute a betrayal, even in everyday culture. And in any case, I think that when you are making a technology which might extinct humanity, the bar should be significantly higher than “normal discourse.” When you are doing something with that much potential for harm, you owe it to society to make firm commitments that you stick to. Otherwise, as kave noted, how are we supposed to trust your other “commitments”? Your RSP? If all you can offer are vague “we’ll figure it out when we get there,” then any ambiguous statement should be interpreted as a vibe, rather than a real plan. And in the absence of unambiguous statements, as all the labs have failed to provide, this is looking very much like “trust us, we’ll do the right thing.” Which, to my mind, is nowhere close to the assurances society ought to be provided given the stakes.
I do think it would be nice if Anthropic did make such a statement, but seeing how adversarially everyone has treated the information they do release, I don’t blame them for not doing so.
This reasoning seems to imply that Anthropic should only be obliged to convey information when the environment is sufficiently welcoming to them. But Anthropic is creating a technology which might extinct humanity—they have an obligation to share their reasoning regardless of what society thinks. In fact, if people are upset by their actions, there is more reason, not less, to set the record straight. Public scrutiny of companies, especially when their choices affect everyone, is a sign of healthy discourse.
The implicit bid for people not to discourage them—because that would make it less likely for a company to be forthright—seems incredibly backwards, because then the public is unable to mention when they feel Anthropic has made a mistake. And if Anthropic is attempting to serve the public, which they at least pay lip service to through their corporate structure, then they should be grateful for this feedback, and attempt to incorporate it.
So I do blame them for not making such a statement—it is on them to show to humanity, the people they are making decisions for, why those decisions are justified. It is not on society to make the political situation sufficiently palatable such that they don’t face any consequences for the mistakes they have made. It is on them not to make those mistakes, and to own up to them when they do.
I have not heard from anyone who wasn’t released, and I think it is reasonably likely I would have heard from them anonymously on Signal. Also, not releasing a bunch of people after saying they would seems like an enormously unpopular, hard to keep secret, and not very advantageous move for OpenAI, which is already taking a lot of flak for this.
I’m not necessarily imagining that OpenAI failed to release a bunch of people, although that still seems possible to me. I’m more concerned that they haven’t released many key people, and while I agree that you might have received an anonymous Signal message to that effect if it were true, I still feel alarmed that many of these people haven’t publicly stated otherwise.
I also have a model of how people choose whether or not to make public statements where it’s extremely unsurprising most people would not choose to do so.
I do find this surprising. Many people are aware of who former OpenAI employees are, and hence are aware of who was (or is) bound by this agreement. At the very least, if I were in this position, I would want people to know that I was no longer bound. And it does seem strange to me, if the contract has been widely retracted, that so few prominent people have confirmed being released.
It also seems pretty important to figure out who is under mutual non-disparagement agreements with OpenAI, which would still (imo) pose a problem if it applied to anyone in safety evaluations or policy positions.
I imagine many of the people going into leadership positions were prepared to ignore the contract, or maybe even forgot about the nondisparagement clause
I could imagine it being the case that people are prepared to ignore the contract. But unless they publicly state that, it wouldn’t ameliorate my concerns—otherwise how is anyone supposed to trust they will?
The clause is also open to more avenues of legal attack if it’s enforced against someone who takes another position which requires disparagement (e.g. if it’s argued to be a restriction on engaging in business).
That seems plausible, but even if this does increase the likelihood that they’d win a legal battle, legal battles still pose huge risk and cost. This still seems like a meaningful deterrent.
I don’t think it’s fair to view this as a serious breach of trust on behalf of any individual, without clear evidence that it impacted their decisions or communication.
But how could we even get this evidence? If they’re bound to the agreement their actions just look like an absence of saying disparaging things about OpenAI, or of otherwise damaging their finances or reputation. And it’s hard to tell, from the outside, whether this is a reflection of an obligation, or of a genuine stance. Positions of public responsibility require public trust, and the public doesn’t have access to the inner workings of these people’s minds. So I think it’s reasonable, upon finding out that someone has a huge and previously-undisclosed conflict of interest, to assume that might be influencing their behavior.
A few of the scientists I’ve read about have realized their big ideas in moments of insight (e.g., Darwin for natural selection, Einstein for special relativity). My current guess about what’s going on is something like: as you attempt to understand a concept you don’t already have, you’re picking up clues about what the shape of the answer is going to look like (i.e., constraints). Once you have these constraints in place, your mind is searching for something which satisfies all of them (both explicitly and implicitly), and insight is the thing that happens when you find a solution that does.
At least, this is what it feels like for me when I play Baba is You (i.e., when I have the experience you’re describing here). I always know when a fake solution is fake, because it’s really easy to tell that it violates one of the explicit constraints the game has set out (although sometimes in desperation I try it anyway :p). But it’s immediately clear when I’ve landed on the right solution (even before I execute it), because all of the constraints I’ve been holding in my head get satisfied at once. I think that’s the “clicking” feeling.
Darwin’s insight about natural selection was also shaped by constraints. His time on the Beagle had led him to believe that “species gradually become modified,” but he was pretty puzzled as to how the changes were being introduced. If you imagine a beige lizard that lives in the sand, for instance, it seems pretty clear that it isn’t the lizard itself (its will) which causes its beigeness, nor is it the sand that directly causes the coloring (as in, physically causes it within the lizards lifetime). But then, how are changes introduced, if not by the organism, and not by the environment directly? He was stuck on this for awhile, when: “I can remember the very spot in the road, whilst in my carriage, when to my joy the solution occurred to me.”
There’s more going on to Darwin’s story than that, but I do think it has elements of the sort of thing you’re describing here. Jeff Hawkins also describes insight as a constraint satisfaction problem pretty explicitly (I might’ve gotten this idea from him), and he experienced it when coming up with the idea of a thousand brains.
Anyway, I don’t have a strong sense of how crucial this sort of thing is to novel conceptual inquiry in general, but I do think it’s quite interesting. It seems like one of the ways that someone can go from a pre-paradigmatic grasping around for clues sort of thing to a fully formed solution.