aysja

Karma: 3,253

aysja Oct 16, 2024, 1:59 PM
45 points
13
on: Anthropic’s updated Responsible Scaling Policy
In the previous RSP, I had the sense that Anthropic was attempting to draw red lines—points at which, if models passed certain evaluations, Anthropic committed to pause and develop new safeguards. That is, if ~~evaluations triggered,~~ ~~then~~ they would implement safety measures. The “if” was already sketchy in the first RSP, as Anthropic was allowed to “determine whether the evaluation was overly conservative,” i.e., they were allowed to retroactively declare red lines green. Indeed, with such caveats it was difficult for me to see the RSP as much more than a declared intent to act responsibly, rather than a commitment. But the updated RSP seems to be far worse, even, than that: the “if” is no longer dependent on the outcomes of pre-specified evaluations, but on the personal judgment of Dario Amodei and Jared Kaplan.
~~Indeed, such red lines are now made~~ ~~more~~ implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.
~~This seems strictly worse to me. Some room for flexibility is warranted, but this strikes me as almost~~ ~~maximally~~ flexible, in that practically nothing is predefined—not evaluations, nor safeguards, nor responses to evaluations. This update makes the RSP more subjective, qualitative, and ambiguous. And if Anthropic is going to make the RSP weaker, I wish this were noted more as an apology, or along with a promise to rectify this in the future. Especially because after a year, Anthropic presumably has more information about the risk than before. Why, then, is even more flexibility needed now? What ~~would~~ ~~cause Anthropic to make clear commitments?~~
I also find it unsettling that the ASL-3 risk threshold has been substantially changed, and the reasoning for this is not explained. In the first RSP, a model was categorized as ASL-3 if it was capable of various precursors for autonomous replication. Now, this has been downgraded to a “checkpoint,” a point at which they promise to evaluate the situation more thoroughly, but don’t commit to taking any particular actions:
We replaced our previous autonomous replication and adaption (ARA) threshold with a “checkpoint” for autonomous AI capabilities. Rather than triggering higher safety standards automatically, reaching this checkpoint will prompt additional evaluation of the model’s capabilities and accelerate our preparation of stronger safeguards.
This strikes me as a big change. The ability to self-replicate is already concerning, but the ability to perform AI R&D seems potentially catastrophic, risking loss of control or extinction. Why does Anthropic now think this shouldn’t count as ASL-3? Why have they substituted this criteria with a substantially riskier one instead?
Dario estimates the probability of something going “really quite catastrophically wrong, on the scale of human civilization” as between 10-25%. He also thinks this might happen soon—perhaps between 2025-2027. It seems obvious to me that a policy this ambiguous, this dependent on figuring things out on the fly, this beset with such egregious conflicts of interest, is a radically insufficient means of managing risk from a technology which poses so grave and imminent a threat to our world.

aysja Oct 4, 2024, 4:29 AM
22 points
8
in reply to: DanielFilan’s comment on: DanielFilan’s Shortform Feed
Basically I just agree with what James said. But I think the steelman is something like: you should expect shorter (or no) pauses with an RSP if all goes well, because the precautions are matched to the risks. Like, the labs aim to develop safety measures which keep pace with the dangers introduced by scaling, and if they succeed at that, then they never have to pause. But even if they fail, they’re also expecting that building frontier models will help them solve alignment faster. I.e., either way the overall pause time would probably be shorter?
It does seem like in order to not have this complaint about the RSP, though, you need to expect that it’s shorter by a lot (like by many months or years). My guess is that the labs do believe this, although not for amazing reasons. Like, the answer which feels most “real” to me is that this complaint doesn’t apply to RSPs because the labs aren’t actually planning to do a meaningful pause.

aysja Sep 30, 2024, 11:01 PM
6 points
3
on: MATS Alumni Impact Analysis
Does the category “working/interning on AI alignment/control” include safety roles at labs? I’d be curious to see that statistic separately, i.e., the percentage of MATS scholars who went on to work in any role at labs.

aysja Sep 21, 2024, 9:58 AM
30 points
4
on: Skills from a year of Purposeful Rationality Practice
Similarly in Baba is You: when people don’t have a crisp understanding of the puzzle, they tend to grasp and straws and motivatedly-reason their way into accepting sketchy sounding premises. But, the true solution to a level often feels very crisp and clear and inevitable.
A few of the scientists I’ve read about have realized their big ideas in moments of insight (e.g., Darwin for natural selection, Einstein for special relativity). My current guess about what’s going on is something like: as you attempt to understand a concept you don’t already have, you’re picking up clues about what the shape of the answer is going to look like (i.e., constraints). Once you have these constraints in place, your mind is searching for something which satisfies all of them (both explicitly and implicitly), and insight is the thing that happens when you find a solution that does.
At least, this is what it feels like for me when I play Baba is You (i.e., when I have the experience you’re describing here). I always know when a fake solution is fake, because it’s really easy to tell that it violates one of the explicit constraints the game has set out (although sometimes in desperation I try it anyway :p). But it’s immediately clear when I’ve landed on the right solution (even before I execute it), because all of the constraints I’ve been holding in my head get satisfied at once. I think that’s the “clicking” feeling.
Darwin’s insight about natural selection was also shaped by constraints. His time on the Beagle had led him to believe that “species gradually become modified,” but he was pretty puzzled as to how the changes were being introduced. If you imagine a beige lizard that lives in the sand, for instance, it seems pretty clear that it isn’t the lizard itself (its will) which causes its beigeness, nor is it the sand that directly causes the coloring (as in, physically causes it within the lizards lifetime). But then, how are changes introduced, if not by the organism, and not by the environment directly? He was stuck on this for awhile, when: “I can remember the very spot in the road, whilst in my carriage, when to my joy the solution occurred to me.”
There’s more going on to Darwin’s story than that, but I do think it has elements of the sort of thing you’re describing here. Jeff Hawkins also describes insight as a constraint satisfaction problem pretty explicitly (I might’ve gotten this idea from him), and he experienced it when coming up with the idea of a thousand brains.
Anyway, I don’t have a strong sense of how crucial this sort of thing is to novel conceptual inquiry in general, but I do think it’s quite interesting. It seems like one of the ways that someone can go from a pre-paradigmatic grasping around for clues sort of thing to a fully formed solution.

aysja Sep 17, 2024, 10:38 PM
3 points
0
in reply to: Marius Hobbhahn’s comment on: TurnTrout’s shortform feed
I’m somewhat confused about when these evaluations are preformed (i.e., how much safety training the model has undergone). OpenAI’s paper says: “Red teamers had access to various snapshots of the model at different stages of training and mitigation maturity starting in early August through mid-September 2024,” so it seems like this evaluation was probably performed several times. Were these results obtained only prior to safety training or after? The latter seems more concerning to me, so I’m curious.

aysja Sep 8, 2024, 10:48 PM
13 points
4
in reply to: ryan_greenblatt’s comment on: Pay Risk Evaluators in Cash, Not Equity
You actually want evaluators to have as much skin in the game as other employees so that when they take actions that might shut the company down or notably reduce the value of equity, this is a costly signal.
Further, it’s good if evaluators are just considered normal employees and aren’t separated out in any specific way. Then, other employees at the company will consider these evaluators to be part of their tribe and will feel some loyalty. (Also reducing the chance that risk evaluators feel like they are part of rival “AI safety” tribe.) This probably has a variety of benefits in terms of support from the company. For example, when evaluators make a decision with is very costly for the company, it is more likely to respected by other employees.
This situation seems backwards to me. Like, presumably the ideal scenario is that a risk evaluator estimates the risk in an objective way, and then the company takes (hopefully predefined) actions based on that estimate. The outcome of this interaction should not depend on social cues like how loyal they seem, or how personally costly it was for them to communicate that information. To the extent it does, I think this is evidence that the risk management framework is broken.

aysja Sep 7, 2024, 2:23 AM
48 points
9
on: The Checklist: What Succeeding at AI Safety Will Involve
Thanks for writing this! I think it’s important for AI labs to write and share their strategic thoughts; I appreciate you doing so. I have many disagreements, but I think it’s great that the document is clear enough to disagree with.
You start the post by stating that “Our ability to do our safety work depends in large part on our access to frontier technology,” but you don’t say why. Like, there’s a sense in which much of this plan is predicated on Anthropic needing to stay at the frontier, but this document doesn’t explain why this is the right call to begin with. There are clearly some safety benefits to having access to frontier models, but the question is: are those benefits worth the cost? Given that this is (imo) by far the most important strategic consideration for Anthropic, I’m hoping for far more elaboration here. Why does Anthropic believe it’s important to work on advancing capabilities at all? Why is it worth the potentially world-ending costs?
This section also doesn’t explain why Anthropic needs to advance the frontier. For instance, it isn’t clear to me that anything from “Chapter 1” requires this—does remaining slightly behind the frontier prohibit Anthropic from e.g. developing automated red-teaming, or control techniques, or designing safety cases, etc.? Why? Indeed, as I understand it, Anthropic’s initial safety strategy was to remain behind other labs. Now Anthropic does push the frontier, but as far as I know no one has explained what safety concerns (if any) motivated this shift.
This is especially concerning because pushing the frontier seems very precarious, in the sense you describe here:
If [evaluations are] significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.
… and here:
As with other aspects of the RSP described above, there are significant costs to both evaluations that trigger too early and evaluations that trigger too late.
But without a clear sense of why advancing the frontier is helpful for safety in the first place, it seems pretty easy to imagine missing this narrow target.
Like, here is a situation I feel worried about. We continue to get low quality evidence about the danger of these systems (e.g. via red-teaming). This evidence is ambiguous and confusing—if a system can in fact do something scary (such as insert a backdoor into another language model), what are we supposed to infer from that? Some employees might think it suggests danger, but others might think that it, e.g., wouldn’t be able to actually execute such plans, or that it’s just a one-off fluke but still too incompetent to pose real threat, etc. How is Anthropic going to think about this? The downside of being wrong is, as you’ve stated, extreme: a long enough pause could kill the company. And the evidence itself is almost inevitably going to be quite ambiguous, because we don’t understand what’s happening inside the model such that it’s producing these outputs.
But so long as we don’t understand enough about these systems to assess their alignment with confidence, I am worried that Anthropic will keep deciding to scale. Because when the evidence is as indirect and unclear as that which is currently possible to gather, interpreting it is basically just a matter of guesswork. And given the huge incentive to keep scaling, I feel skeptical that Anthropic will end up deciding to interpret anything but unequivocal evidence as suggesting enough danger to stop.
This is concerning because Anthropic seems to anticipate such ambiguity, as suggested by the RSP lacking any clear red lines. Ideally, if Anthropic finds that their model is capable of, e.g., self-replication, then this would cause some action like “pause until safety measures are ready.” But in fact what happens is this:
If sufficient measures are not yet implemented, pause training and analyze the level of risk presented by the model. In particular, conduct a thorough analysis to determine whether the evaluation was overly conservative, or whether the model indeed presents near-next-ASL risks.
In other words, one of the first steps Anthropic plans to take if a dangerous evaluation threshold triggers, is to question whether that evaluation was actually meaningful in the first place. I think this sort of wiggle room, which is pervasive throughout the RSP, renders it pretty ineffectual—basically just a formal-sounding description of what they (and other labs) were already doing, which is attempting to crudely eyeball the risk.
And given that the RSP doesn’t bind Anthropic to much of anything, so much of the decision making largely hinges on the quality of its company culture. For instance, here is Nick Joseph’s description:
Fortunately, I think my colleagues, both on the RSP and elsewhere, are both talented and really bought into this, and I think we’ll do a great job on it. But I do think the criticism is valid, and that there is a lot that is left up for interpretation here, and it does rely a lot on people having a good-faith interpretation of how to execute on the RSP internally.
[...]
But I do agree that ultimately you need to have a culture around thinking these things are important and having everyone bought in. As I said, some of these things are like, did you solicit capabilities well enough? That really comes down to a researcher working on this actually trying their best at it. And that is quite core, and I think that will just continue to be.
Which is to say that Anthropic’s RSP doesn’t appear to me to pass the LeCun test. Not only is the interpretation of the evidence left up to Anthropic’s discretion (including retroactively deciding whether a test actually was a red line), but the quality of the safety tests themselves are also a function of company culture (i.e., of whether researchers are “actually trying their best” to “solicit capabilities well enough.”)
I think the LeCun test is a good metric, and I think it’s good to aim for. But when the current RSP is so far from passing it, I’m left wanting to hear more discussion of how you’re expecting it to get there. What do you expect will change in the near future, such that balancing these delicate tradeoffs—too lax vs. too strict, too vague vs. too detailed, etc.—doesn’t result in another scaling policy which also doesn’t constrain Anthropic’s ability to scale roughly at all? What kinds of evidence are you expecting you might encounter, that would actually count as a red line? Once models become quite competent, what sort of evidence will convince you that the model is safe? Aligned? And so on.

aysja Aug 24, 2024, 9:17 PM
44 points
21
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform
I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.
Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.
And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”

aysja Aug 22, 2024, 9:03 PM
72 points
85
on: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve.
I think it’s really alarming that this safety framework contains no commitments, and I’m frustrated that concern about this is brushed aside. If DeepMind is aiming to build AGI soon, I think it’s reasonable to expect they have a mitigation plan more concrete than a list of vague IOU’s. For example, consider this description of a plan from the FSF:
When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results.
This isn’t a baseline of reasonable practices upon which better safety practices can evolve, so much as a baseline of zero—the fact that DeepMind will take some unspecified action, rather than none, if an evaluation triggers does not appear to me to be much of a safety strategy at all.
Which is especially concerning, given that DeepMind might create AGI within the next few years. If it’s too early to make commitments now, when do you expect that will change? What needs to happen, before it becomes reasonable to expect labs to make safety commitments? What sort of knowledge are you hoping to obtain, such that scaling becomes a demonstrably safe activity? Because without any sense of when these “best practices” might emerge, or how, the situation looks alarmingly much like the labs requesting a blank check for as long as they deem fit, and that seems pretty unacceptable to me.
AI is a nascent field. But to my mind its nascency is what’s so concerning: we don’t know how to ensure these systems are safe, and the results could be catastrophic. This seems much more reason to not build AGI than it does to not offer any safety commitments. But if the labs are going to do the former, then I think they have a responsibility to provide more than a document of empty promises—they should be able to state, exactly, what would cause them to stop building these potentially lethal machines. So far no lab has succeeded at this, and I think that’s pretty unacceptable. If labs would like to take up the mantle of governing their own safety, then they owe humanity the service of meaningfully doing it.

aysja Aug 22, 2024, 8:41 PM
35 points
23
in reply to: Zac Hatfield-Dodds’s comment on: Zach Stein-Perlman’s Shortform
I’m sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don’t know what the decision process inside of Anthropic will look like if an evaluation indicates something like “yeah, it’s excellent at inserting backdoors, and also, the vibe is that it’s overall pretty capable.” And given that Anthropic is making these decisions on behalf of everyone, Anthropic (like all labs) really owes it to humanity to be more upfront about how it’ll make these decisions (imo).

I will also note what I feel is a somewhat concerning trend. It’s happened many times now that I’ve critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: “this wouldn’t seem so bad if you knew what was happening behind the scenes.” They of course cannot tell me what the “behind the scenes” information is, so I have no way of knowing whether that’s true. And, maybe I would in fact update positively about Anthropic if I knew. But I do think the shape of “we’re doing something which might be incredibly dangerous, many external bits of evidence point to us not taking the safety of this endeavor seriously, but actually you should think we are based on me telling you we are” is pretty sketchy.

aysja Aug 16, 2024, 5:56 AM
10 points
6
in reply to: Raemon’s comment on: Raemon’s Shortform Feed
Largely agree with everything here.
But, I’ve heard some people be concerned “aren’t basically all SSP-like plans basically fake? is this going to cement some random bureaucratic bullshit rather than actual good plans?.” And yeah, that does seem plausible.
I do think that all SSP-like plans are basically fake, and I’m opposed to them becoming the bedrock of AI regulation. But I worry that people take the premise “the government will inevitably botch this” and conclude something like “so it’s best to let the labs figure out what to do before cementing anything.” This seems alarming to me. Afaict, the current world we’re in is basically the worst case scenario—labs are racing to build AGI, and their safety approach is ~“don’t worry, we’ll figure it out as we go.” But this process doesn’t seem very likely to result in good safety plans either; charging ahead as is doesn’t necessarily beget better policies. So while I certainly agree that SSP-shaped things are woefully inadequate, it seems important, when discussing this, to keep in mind what the counterfactual is. Because the status quo is not, imo, a remotely acceptable alternative either.

aysja Aug 2, 2024, 7:41 AM
6 points
2
in reply to: niplav’s comment on: lcmgcd’s Shortform
Or independent thinkers try to find new frames because the ones on offer are insufficient? I think this is roughly what people mean when they say that AI is “pre-paradigmatic,” i.e., we don’t have the frames for filling to be very productive yet. Given that, I’m more sympathetic to framing posts on the margin than I am to filling ones, although I hope (and expect) that filling-type work will become more useful as we gain a better understanding of AI.

aysja Jul 27, 2024, 11:01 PM
60 points
24
in reply to: Linch’s comment on: Linch’s Shortform
I think this letter is quite bad. If Anthropic were building frontier models for safety purposes, then they should be welcoming regulation. Because building AGI right now is reckless; it is only deemed responsible in light of its inevitability. Dario recently said “I think if [the effects of scaling] did stop, in some ways that would be good for the world. It would restrain everyone at the same time. But it’s not something we get to choose… It’s a fact of nature… We just get to find out which world we live in, and then deal with it as best we can.” But it seems to me that lobbying against regulation like this is not, in fact, inevitable. To the contrary, it seems like Anthropic is actively using their political capital—capital they had vaguely promised to spend on safety outcomes, tbd—to make the AI arms race counterfactually worse.
The main changes that Anthropic has proposed—to prevent the formation of new government agencies which could regulate them, to not be held accountable for unrealized harm—are essentially bids to continue voluntary governance. Anthropic doesn’t want a government body to “define and enforce compliance standards,” or to require “reasonable assurance” that their systems won’t cause a catastrophe. Rather, Anthropic would like for AI labs to only be held accountable if a catastrophe in fact occurs, and only so much at that, as they are also lobbying to have their liability depend on the quality of their self-governance: “but if a catastrophe happens in a way that is connected to a defect in a company’s SSP, then that company is more likely to be liable for it.” Which is to say that Anthropic is attempting to inhibit the government from imposing testing standards (what Anthropic calls “pre-harm”), and in general aims to inhibit regulation of AI before it causes mass casualty.
I think this is pretty bad. For one, voluntary self-governance is obviously problematic. All of the labs, Anthropic included, have significant incentive to continue scaling, indeed, they say as much in this document: “Many stakeholders reasonably worry that this [agency]… might end up… impeding innovation in general.” And their attempts to self-govern are so far, imo, exceedingly weak—their RSP commits to practically nothing if an evaluation threshold triggers, leaving all of the crucial questions, such as “what will we do if our models show catastrophic inclinations,” up to Anthropic’s discretion. This is clearly unacceptable—both the RSP in itself, but also Anthropic’s bid for it to continue to serve as the foundation of regulation. Indeed, if Anthropic would like for other companies to be safer, which I believed to be one of their main safety selling points, then they should be welcoming the government stepping in to ensure that.
Afaict their rationale for opposing this regulation is that the labs are better equipped to design safety standards than the government is: “AI safety is a nascent field where best practices are the subject of original scientific research… What is needed in such a new environment is iteration and experimentation, not prescriptive enforcement. There is a substantial risk that the bill and state agencies will simply be wrong about what is actually effective in preventing catastrophic risk, leading to ineffective and/or burdensome compliance requirements.” But there is also, imo, a large chance that Anthropic is wrong about what is actually effective at preventing catastrophic risk, especially so, given that they have incentive to play down such risks. Indeed, their RSP strikes me as being incredibly insufficient at assuring safety, as it is primarily a reflection of our ignorance, rather than one built from a scientific understanding, or really any understanding, of what it is we’re creating.
I am personally very skeptical that Anthropic is capable of turning our ignorance into the sort of knowledge capable of providing strong safety guarantees anytime soon, and soon is the timeframe by which Dario aims to build AGI. Such that, yes, I expect governments to do a poor job of setting industry standards, but only because I expect that a good job is not possible given our current state of understanding. And I would personally rather, in this situation where labs are racing to build what is perhaps the most powerful technology ever created, to err on the side of the government guessing about what to do, and beginning to establish some enforcement about that, than to leave it for the labs themselves to decide.
Especially so, because if one believes, as Dario seems to, that AI has a significant chance of causing massive harm, that it could “destroy us,” and that this might occur suddenly, “indications that we are in a pessimistic or near-pessimistic scenario may be sudden and hard to spot,” then you shouldn’t be opposing regulation which could, in principle, stop this from happening. We don’t necessarily get warning shots with AI, indeed, this is one of the main problems with building it “iteratively,” one of the main problems with Anthropic’s “empirical” approach to AI safety. Because what Anthropic means by “a pessimistic scenario” is that “it’s simply an empirical fact that we cannot control or dictate values to a system that’s broadly more intellectually capable than ourselves.” Simply an empirical fact. And in what worlds do we learn this empirical fact without catastrophic outcomes?
I have to believe that Anthropic isn’t hoping to gain such evidence by way of catastrophes in fact occurring. But if they would like for such pre-harm evidence to have a meaningful impact, then it seems like having pre-harm regulation in place would be quite helpful. Because one of Anthropic’s core safety strategies rests on their ability to “sound the alarm,” indeed, this seems to account for something like ~33% of their safety profile, given that they believe “pessimistic scenarios” are around as likely as good, or only kind of bad scenarios. And in “pessimistic” worlds, where alignment is essentially unsolvable, and catastrophes are impending, their main fallback is to alert the world of this unfortunate fact so that we can “channel collective effort” towards some currently unspecified actions. But the sorts of actions that the world can take, at this point, will be quite limited unless we begin to prepare for them ahead of time.
Like, the United States government usually isn’t keen on shutting down or otherwise restricting companies on the basis of unrealized harm. And even if they were keen, I’m not sure how they would do this—legislation likely won’t work fast enough, and even if the President could sign an executive order to e.g. limit OpenAI from releasing or further creating their products, this would presumably be a hugely unpopular move without very strong evidence to back it up. And it’s pretty difficult for me to see what kind of evidence this would have to be, to take a move this drastic and this quickly. Anything short of the public witnessing clearly terrible effects, such as mass casualty, doesn’t seem likely to pass muster in the face of a political move this extreme.
But in a world where Anthropic is sounding alarms, they are presumably doing so before such catastrophes have occurred. Which is to say that without structures in place to put significant pressure on or outright stop AI companies on the basis of unrealized harm, Anthropic’s alarm sounding may not amount to very much. Such that pushing against regulation which is beginning to establish pre-harm standards makes Anthropic’s case for “sounding the alarm”—a large fraction of their safety profile—far weaker, imo. But I also can’t help but feeling that these are not real plans; not in the beliefs-pay-rent kind of way, at least. It doesn’t seem to me that Anthropic has really gamed out what such a situation would look like in sufficient detail for it to be a remotely acceptable fallback in the cases where, oops, AI models begin to pose imminent catastrophic risk. I find this pretty unacceptable, and I think Anthropic’s opposition to this bill is yet another case where they are at best placing safety second fiddle, and at worst not prioritizing it meaningfully at all.
What links here?
- Joseph Miller's comment on Linch’s Quick takes by Linch (EA Forum; Jul 28, 2024, 6:43 AM; 3 points)

aysja Jul 23, 2024, 9:46 PM
6 points
0
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform
I agree that scoring “medium” seems like it would imply crossing into the medium zone, although I think what they actually mean is “at most medium.” The full quote (from above) says:
In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant postmitigation risk level to be back at most to “medium” level.
I.e., I think what they’re trying to say is that they have different categories of evals, each of which might pass different thresholds of risk. If any of those are “high,” then they’re in the “medium zone” and they can’t deploy. But if they’re all medium, then they’re in the “below medium zone” and they can. This is my current interpretation, although I agree it’s fairly confusing and it seems like they could (and should) be more clear about it.

aysja Jul 23, 2024, 8:57 PM
7 points
0
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform
Maybe I’m missing the relevant bits, but afaict their preparedness doc says that they won’t deploy a model if it passes the “medium” threshold, eg:
Only models with a post-mitigation score of “medium” or below can be deployed. In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place.
The threshold for further developing is set to “high,” though. I.e., they can further develop so long as models don’t hit the “critical” threshold.

aysja Jul 18, 2024, 6:46 PM
4 points
2
in reply to: Richard_Ngo’s comment on: Optimistic Assumptions, Longterm Planning, and “Cope”
But you could just as easily frame this as “abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments”.
But doesn’t this argument hold with the opposite conclusion, too? E.g. “abstract reasoning about unfamiliar domains is hard therefore you should distrust arguments about good-by-default worlds”.

aysja Jul 9, 2024, 12:09 AM
13 points
8
in reply to: Buck’s comment on: 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)
But whether an organization can easily respond is pretty orthogonal to whether they’ve done something wrong. Like, if 80k is indeed doing something that merits a boycott, then saying so seems appropriate. There might be some debate about whether this is warranted given the facts, or even whether the facts are right, but it seems misguided to me to make the strength of an objection proportional to someone’s capacity to respond rather than to the badness of the thing they did.

aysja Jul 6, 2024, 12:00 AM
48 points
33
in reply to: Sam McCandlish’s comment on: Habryka’s Shortform Feed
This comment appears to respond to habryka, but doesn’t actually address what I took to be his two main points—that Anthropic was using NDAs to cover non-disparagement agreements, and that they were applying significant financial incentive to pressure employees into signing them.
We historically included standard non-disparagement agreements by default in severance agreements
Were these agreements subject to NDA? And were all departing employees asked to sign them, or just some? If the latter, what determined who was asked to sign?

aysja Jul 1, 2024, 1:38 AM
6 points
10
in reply to: Bird Concept’s comment on: Habryka’s Shortform Feed
Agreed. I’d be especially interested to hear this from people who have left Anthropic.

aysja Jun 27, 2024, 6:54 AM
5 points
3
in reply to: Raemon’s comment on: Loving a world you don’t trust
I do kind of share the sense that people mostly just want frisbee and tea, but I am still confused about it. Wasn’t religion a huge deal for people for most of history? I could see a world where they were mostly just going through the motions, but the amount of awe I feel going into European churches feels like some evidence against this. And it’s hard for me to imagine that people were kind of just mindlessly sitting there through e.g., Gregorian chanting in candlelight, but maybe I am typical minding too hard. It really seems like these rituals, the architecture, all of it, was built to instill the sort of existential intensity that taking God seriously requires, and I have to imagine that this was at least somewhat real for most people?

And I do wonder whether the shift towards frisbee and tea has more to do more with a lack of options as compelling as cathedrals (on this axis at least), rather than the people’s lack of wanting it? Like, I don’t think I would get as much out of cathedrals as I expect some of these people did, because I’m not religious, but if something of that intensity existed which fit with my belief system I feel like I’d be so into it.