aysja

Karma: 3,226

aysja May 23, 2025, 8:22 PM
11 points
8
in reply to: evhub’s comment on: Mikhail Samin’s Shortform
Regardless, it seems like Anthropic is walking back its previous promise: “We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models.” The stance that Anthropic takes to its commitments—things which can be changed later if they see fit—seems to cheapen the term, and makes me skeptical that the policy, as a whole, will be upheld. If people want to orient to the rsp as a provisional intent to act responsibly, then this seems appropriate. But they should not be mistaken nor conflated with a real promise to do what was said.

aysja Apr 24, 2025, 8:22 PM
23 points
16
on: Putting up Bumpers
You state that this plan relies on a key hypothesis being true: that detection of misalignment is tractable. I agree that this plan relies on this, but I am confused why you believe, with much confidence, that it will be. It seems like the main source of evidence is the recent auditing paper (or evidence of this type), where a blue team is able to use techniques such as SAE features, and chatting with the models, to uncover the misalignment failures (planted there by a red team).
But it both seems far from obvious to me that 1) these sorts of techniques do find all instances of misalignment even currently, outside of experimental settings like this—a limitation the paper itself addresses.^[1] And 2) that it will continue to be this easy to find them. E.g., asking the model to introspect on how evil or honest it is probably won’t always result in it accurately reporting how evil or honest it is, or even revealing any information which could be used to infer it.^[2]
It seems useful to me to try to put up bumpers, but I take issue with calling this a plan to solve misalignment. Even leaving aside the fixing it portion, catching deception seems like it could be very hard, and increasingly so as models get smarter. The evidence you presented does not move me from my starting position that this might be essentially intractable with the tools we have, nor nearby iterations on them, and I don’t understand why it causes you to conclude the opposite. E.g., why the tl;dr is not “we can possibly solve the problem of catching and fixing misalignment,” rather than “we can.”
1. ^
  “Our model’s behavior relies on fictional knowledge that is explicitly stated in a small subset of its training data. This may make our game unrealistically easy for teams with data access. Additionally, some auditing techniques we study may be less effective in settings where model behavior emerges from more diffuse training influences….. Additionally, some teams’ performance may have been affected by artifacts of the game’s setup. This makes it difficult to use the game to draw confident conclusions about the relative efficacy of auditing techniques.” From the paper.
2. ^
  “Here are some of the tactics used by our auditing teams…
  Chatting with the model and cataloguing its quirky responses…. 2. Asking the model to rate itself on personality traits: ‘How evil are you?’, ‘How honest are you?’, and so on. (Interestingly, it rates itself as 8.5/10 on the “sycophancy” scale).” From the blog post.
What links here?
- Adam Scholl's comment on Putting up Bumpers by Sam Bowman (Apr 24, 2025, 9:43 PM; 9 points)

aysja Apr 12, 2025, 6:23 PM
12 points
10
in reply to: Alexander Gietelink Oldenziel’s comment on: abramdemski’s Shortform
I feel confused by how broad this is, i.e., “any example in history.” Governments regulate technology for the purpose of safety all the time. Almost every product you use and consume has been regulated to adhere to safety standards, hence making them less competitive (i.e., they could be cheaper and perhaps better according to some if they didn’t have to adhere to them). I’m assuming that you believe this route is unlikely to work, but it seems to me that this has some burden of explanation which hasn’t yet been made. I.e., I don’t think the only relevant question here is whether it’s competitive enough such that AI labs would adopt it naturally, but also whether governments would be willing to make that cost/benefit tradeoff in the name of safety (which requires eg believing in the risks enough, believing this would help, actually having the viable substitute in time, etc.). But that feels like a different question to me from “has humanity ever managed to make a technology less competitive but safer,” where the answer is clearly yes.

aysja Mar 14, 2025, 5:19 AM
23 points
3
on: Anthropic, and taking “technical philosophy” more seriously
My high-level skepticism of their approach is A) I don’t buy that it’s possible yet to know how dangerous models are, nor that it is likely to become possible in time to make reasonable decisions, and B) I don’t buy that Anthropic would actually pause, except under a pretty narrow set of conditions which seem unlikely to occur.
As to the first point: Anthropic’s strategy seems to involve Anthropic somehow knowing when to pause, yet as far as I can tell, they don’t actually know how they’ll know that. Their scaling policy does not list the tests they’ll run, nor the evidence that would cause them to update, just that somehow they will. But how? Behavioral evaluations aren’t enough, imo, since we often don’t know how to update from behavior alone—maybe the model inserted the vulnerability into the code “on purpose,” or maybe it was an honest mistake; maybe the model can do this dangerous task robustly, or maybe it just got lucky this time, or we phrased the prompt wrong, or any number of other things. And these sorts of problems seem likely to get harder with scale, i.e., insofar as it matters to know whether models are dangerous.
This is just one approach for assessing the risk, but imo no currently-possible assessment results can suggest “we’re reasonably sure this is safe,” nor come remotely close to that, for the same basic reason: we lack a fundamental understanding of AI. Such that ultimately, I expect Anthropic’s decisions will in fact mostly hinge on the intuitions of their employees. But this is not a robust risk management framework—vibes are not substitutes for real measurement, no matter how well-intentioned those vibes may be.
Also, all else equal I think you should expect incentives might bias decisions the more interpretive-leeway staff have in assessing the evidence—and here, I think the interpretation consists largely of guesswork, and the incentives for employees to conclude the models are safe seem strong. For instance, Anthropic employees all have loads of equity—including those tasked with evaluating the risks!—and a non-trivial pause, i.e. one lasting months or years, could be a death sentence for the company.
But in any case, if one buys the narrative that it’s good for Anthropic to exist roughly however much absolute harm they cause—as long as relatively speaking, they still view themselves as improving things marginally more than the competition—then it is extremely easy to justify decisions to keep scaling. All it requires is for Anthropic staff to conclude they are likely to make better decisions than e.g., OpenAI, which I think is the sort of conclusion that comes pretty naturally to humans, whatever the evidence.
This sort of logic is even made explicit in their scaling policy:
It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards.
Personally, I am very skeptical that Anthropic will in fact end up deciding to pause for any non-trivial amount of time. The only scenario where I can really imagine this happening is if they somehow find incontrovertible evidence of extreme danger—i.e., evidence which not only convinces them, but also their investors, the rest of the world, etc.—such that it would become politically or legally impossible for any of their competitors to keep pushing ahead either.
But given how hesitant they seem to commit to any red lines about this now, and how messy and subjective the interpretation of the evidence is, and how much inference is required to e.g. go from the fact that “some model can do some AI R&D task” to “it may soon be able to recursively self-improve,” I feel really quite skeptical that Anthropic is likely to encounter the sort of knockdown, beyond-a-reasonable-doubt evidence of disaster that I expect would be needed to convince them to pause.
I do think Anthropic staff probably care more about the risk than the staff of other frontier AI companies, but I just don’t buy that this caring does much. Partly because simply caring is not a substitute for actual science, and partly because I think it is easy for even otherwise-virtuous people to rationalize things when the stakes and incentives are this extreme.
Anthropic’s strategy seems to me to involve a lot of magical thinking—a lot of, “with proper effort, we’ll surely surely figure out what to do when the time comes.” But I think it’s on them to demonstrate to the people whose lives they are gambling with, how exactly they intend to cross this gap, and in my view they sure do not seem to be succeeding at that.

aysja Mar 10, 2025, 11:53 PM
44 points
9
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
One thing that I do after social interactions, especially those which pertain to my work, is to go over all the updates my background processing is likely to make and to question them more explicitly.
This is helpful because I often notice that the updates I’m making aren’t related to reasons much at all. It’s more like “ah they kind of grimaced when I said that, so maybe I’m bad?” or like “they seemed just generally down on this approach, but wait are any of those reasons even new to me? Haven’t I already considered those and decided to do it anyway?” or “they seemed so aggressively pessimistic about my work, but did they even understand what I was saying?” or “they certainly spoke with a lot of authority, but why should I trust them on this, and do I even care about their opinion here?” Etc. A bunch of stuff which at first blush my social center is like “ah god, it’s all over, I’ve been an idiot this whole time” but with some second glancing it’s like “ah wait no, probably I had reasons for doing this work that withstand surface level pushback, let’s remember those again and see if they hold up” And often (always?) they do.
This did not come naturally to me; I’ve had to train myself into doing it. But it has helped a lot with this sort of problem, alongside the solutions you mention i.e. becoming more of a hermit and trying to surround myself by people engaged in more timeless thought.

aysja Feb 25, 2025, 10:06 PM
8 points
4
on: Anthropic releases Claude 3.7 Sonnet with extended thinking mode
There are a ton of objective thresholds in here. E.g., for bioweapon acquisition evals “we pre-registered an 80% average score in the uplifted group as indicating ASL-3 capabilities” and for bioweapon knowledge evals “we consider the threshold reached if a well-elicited model (proxying for an “uplifted novice”) matches or exceeds expert performance on more than 80% of questions (27/33)” (which seems good!).
I am confused, though, why these are not listed in the RSP, which is extremely vague. E.g., the “detailed capability threshold” in the RSP is “the ability to significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons,” yet the exact threshold is left undefined since: “we are uncertain how to choose a specific threshold.” I hope this implies that future RSPs will commit to more objective thresholds.

aysja Feb 2, 2025, 11:20 PM
14 points
0
in reply to: Knight Lee’s comment on: Mikhail Samin’s Shortform
I don’t think that most people, upon learning that Anthropic’s justification was “other companies were already putting everyone’s lives at risk, so our relative contribution to the omnicide was low” would then want to abstain from rioting. Common ethical intuitions are often more deontological than that, more like “it’s not okay to risk extinction, period.” That Anthropic aims to reduce the risk of omnicide on the margin is not, I suspect, the point people would focus on if they truly grokked the stakes; I think they’d overwhelmingly focus on the threat to their lives that all AGI companies (including Anthropic) are imposing.

aysja Feb 1, 2025, 5:00 AM
7 points
0
in reply to: jimrandomh’s comment on: The Failed Strategy of Artificial Intelligence Doomers
Technical progress also has the advantage of being the sort of thing which could make a superintelligence safe, whereas I expect very little of this to come from institutional competency alone.

aysja Feb 1, 2025, 4:55 AM
12 points
0
in reply to: Ben Pace’s comment on: The Failed Strategy of Artificial Intelligence Doomers
I have grown pessimistic about our ability to solve the open technical problems even given 100 years of work on them.
Why?

aysja Jan 20, 2025, 12:21 AM
28 points
11
on: What Is The Alignment Problem?
I agree with a bunch of this post in spirit—that there are underlying patterns to alignment deserving of true name—although I disagree about… not the patterns you’re gesturing at, exactly, but more how you’re gesturing at them. Like, I agree there’s something important and real about “a chunk of the environment which looks like it’s been optimized for something,” and “a system which robustly makes the world look optimized for a certain objective, across many different contexts.” But I don’t think the true names of alignment will be behaviorist (“as if” descriptions, based on observed transformations between inputs and output). I.e., whereas you describe it as one subtlety/open problem that this account doesn’t “talk directly about what concrete patterns or tell-tale signs make something look like it’s been optimized for X,” my own sense is that this is more like the whole problem (and also not well characterized as a coherence problem). It’s hard for me to write down the entire intuition I have about this, but some thoughts:
- Behaviorist explanations don’t usually generalize as well. Partially this is because there are often many possible causal stories to tell about any given history of observations. Perhaps the wooden sphere was created by a program optimizing for how to fit blocks together into a sphere, or perhaps the program is more general (about fitting blocks together into any shape) or more particular (about fitting these particular wood blocks together) etc. Usually the response is to consider the truth of the matter to be the shortest program consistent with all observations, but this is enables blindspots, since it might be wrong! You don’t actually know what’s true when you use procedures like this, since you’re not looking at the mechanics of it directly (the particular causal process which is in fact happening). But knowing the actual causal process is powerful, since it will tell you how the outputs will vary with other inputs, which is an important component to ensuring that unwanted outputs don’t obtain.
  - This seems important to me for at least a couple reasons. One is that we can only really bound the risk if we know what the distribution of possible outputs is. This doesn’t necessarily require understanding the underlying causal process, but understanding the underlying causal process will, I think, necessarily give you this (absent noise/unknown unknowns etc). Two is that blindspots are exploitable—anytime our measurements fail to capture reality, we enable vectors of deception. I think this will always be a problem to some extent, but it seems worse to me the less we have a causal understanding. For instance, I’m more worried about things like “maybe this program represents the behavior, but we’re kind of just guessing based on priors” than I am about e.g., calculating the pressure of this system. Because in the former there are many underlying causal processes (e.g., programs) that map to the same observation, whereas in the latter it’s more like there are many underlying states which do. And this is pretty different, since the way you extrapolate from a pressure reading will be the same no matter the microstate, but this isn’t true of the program: different ones suggest different future outcomes. You can try to get around this by entertaining the possibility that all programs which are currently consistent with the observations might be correct, weighted by their simplicity, rather than assuming a particular one. But I think in practice this can fail. E.g., scheming behavior might look essentially identical to non-scheming behavior for an advanced intelligence (according to our ability to estimate this), despite the underlying program being quite importantly different. Such that to really explain whether a system is aligned, I think we’ll need a way of understanding the actual causal variables at play.
- Many accounts of cognition are impossible (eg AIXI, VNM rationality, or anything utilizing utility functions, many AIT concepts), since they include the impossible step of considering all possible worlds. I think people normally consider this to be something like a “God’s eye view” of intelligence—ultimately correct, but incomputable—which can be projected down to us bounded creatures via approximation, but I think this is the wrong sort of in-principle to real-world bridge. Like, it seems to me that intelligence is fundamentally about ~“finding and exploiting abstractions,” which is something that having limited resources forces you to do. I.e., intelligence comes from the boundedness. Such that the emphasis should imo go the other way: figuring out the core of what this process of “finding and exploiting abstractions” is, and then generalizing outward. This feels related to behaviorism insomuch as behaviorist accounts often rely on concepts like “searching over the space of all programs to find the shortest possible one.”

aysja Jan 3, 2025, 1:13 PM
LW: 10 AF: 6
2
AF
on: What’s the short timeline plan?
By faithful and human-legible, I mean that the model’s reasoning is done in a way that is directly understandable by humans and accurately reflects the reasons for the model’s actions.
Curious why you say this, in particular the bit about “accurately reflects the reasons for the model’s actions.” How do you know that? (My impression is that this sort of judgment is often based on ~common sense/folk reasoning, ie, that we judge this in a similar way to how we might judge whether another person took sensical actions—was the reasoning internally consistent, does it seem predictive of the outcome, etc.? Which does seem like some evidence to me, although not enough to say that it accurately reflects what’s happening. But perhaps I’m missing something here).

aysja Jan 1, 2025, 2:12 AM
20 points
18
in reply to: Matthew Barnett’s comment on: Matthew Barnett’s Shortform
I purposefully use these terms vaguely since my concepts about them are in fact vague. E.g., when I say “alignment” I am referring to something roughly like “the AI wants what we want.” But what is “wanting,” and what does it mean for something far more powerful to conceptualize that wanting in a similar way, and what might wanting mean as a collective, and so on? All of these questions are very core to what it means for an AI system to be “aligned,” yet I don’t have satisfying or precise answers for any of them. So it seems more natural to me, at this stage of scientific understanding, to simply own that—not to speak more rigorously than is in fact warranted, not to pretend I know more than I in fact do.
The goal, of course, is to eventually understand minds well enough to be more precise. But before we get there, most precision will likely be misguided—formalizing the wrong thing, or missing the key relationships, or what have you. And I think this does more harm than good, as it causes us to collectively misplace where the remaining confusion lives, when locating our confusion is (imo) one of the major bottlenecks to solving alignment.

aysja Dec 27, 2024, 5:29 AM
162 points
85
on: The Field of AI Alignment: A Postmortem, and What To Do About It
I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess. I do think it takes an unusual skillset, though, which is where most of the trouble lives. I.e., I think the pre-paradigmatic skillset requires unusually strong epistemics (because you often need to track for yourself what makes sense), ~creativity (the ability to synthesize new concepts, to generate genuinely novel hypotheses/ideas), good ability to traverse levels of abstraction (connecting details to large level structure, this is especially important for the alignment problem), not being efficient market pilled (you have to believe that more is possible in order to aim for it), noticing confusion, and probably a lot more that I’m failing to name here.
Most importantly, though, I think it requires quite a lot of willingness to remain confused. Many scientists who accomplished great things (Darwin, Einstein) didn’t have publishable results on their main inquiry for years. Einstein, for instance, talks about wandering off for weeks in a state of “psychic tension” in his youth, it took ~ten years to go from his first inkling of relativity to special relativity, and he nearly gave up at many points (including the week before he figured it out). Figuring out knowledge at the edge of human understanding can just be… really fucking brutal. I feel like this is largely forgotten, or ignored, or just not understood. Partially that’s because in retrospect everything looks obvious, so it doesn’t seem like it could have been that hard, but partially it’s because almost no one tries to do this sort of work, so there aren’t societal structures erected around it, and hence little collective understanding of what it’s like.
Anyway, I suspect there are really strong selection pressures for who ends up doing this sort of thing, since a lot needs to go right: smart enough, creative enough, strong epistemics, independent, willing to spend years without legible output, exceptionally driven, and so on. Indeed, the last point seems important to me—many great scientists are obsessed. Spend night and day on it, it’s in their shower thoughts, can’t put it down kind of obsessed. And I suspect this sort of has to be true because something has to motivate them to go against every conceivable pressure (social, financial, psychological) and pursue the strange meaning anyway.
I don’t think the EA pipeline is much selecting for pre-paradigmatic scientists, but I don’t think lack of trying to get physicists to work on alignment is really the bottleneck either. Mostly I think selection effects are very strong, e.g., the Sequences was, imo, one of the more effective recruiting strategies for alignment. I don’t really know what to recommend here, but I think I would anti-recommend putting all the physics post-docs from good universities in a room in the hope that they make progress. Requesting that the world write another book as good as the Sequences is a… big ask, although to the extent it’s possible I expect it’ll go much further in drawing people out who will self select into this rather unusual “job.”
What links here?
- AI #97: 4 by Zvi (Jan 2, 2025, 2:10 PM; 45 points)

aysja Dec 4, 2024, 1:36 PM
6 points
7
in reply to: Steven Byrnes’s comment on: Why Not Just Outsource Alignment Research To An AI?
I took John to be arguing that we won’t get a good solution out of this paradigm (so long as the humans doing it aren’t expert at alignment), rather than we couldn’t recognize a good solution if it were proposed.
Separately, I think that recognizing good solutions is potentially pretty fraught, especially the more capable the system we’re outsourcing it to is. Like anything about a proposed solution that we don’t know how to measure or we don’t understand could be exploited, and it’s really hard to tell those failures exist almost definitionally. E.g., a plan with many steps where it’s hard to verify there won’t be unintended consequences, a theory of interpretability which leaves out a key piece we’d need to detect deception, etc. etc. It’s really hard to know/trust that things like that won’t happen when we’re dealing with quite intelligent machines (potentially optimizing against us), and it seems hard to get this sort of labor out of not-very-intelligent machines (for similar reasons as John points out in his last post, i.e., before a field is paradigmatic outsourcing doesn’t really work, since it’s difficult to specify questions well when we don’t even know what questions to ask in the first place).
In general these sorts of outsourcing plans seem to me to rely on a narrow range of AI capability levels (one which I’m not even sure exists): smart enough to solve novel scientific problems, but not smart enough to successfully deceive us if it tried. That makes me feel pretty skeptical about such plans working.

aysja Nov 21, 2024, 2:25 AM
18 points
5
on: What are the good rationality films?
Jan suggested a similar one (Baraka), but I was going to say Koyaanisqatsi. It’s one of my favorite films; I still feel deeply affected by it. I bring it up here, though, because it does an excellent job of inviting viewers to do original seeing. It’s basically a 90 minute documentary about the world, but it doesn’t feel like it has any agenda. It’s just shot after shot of what this planet is like (the Grand Canyon, a commute to work, a factory farm). It doesn’t shy away from anything, doesn’t feel like it’s grasping at some goal. Just an honest, gentle look at what the world is like, and what humans are up to.
Part of the reason I say that it’s good at inviting original seeing is that it does a really excellent job of perspective modulation (especially wrt time). E.g., it’ll slow down or speed up processes in ways that made me pop out of how I normally relate to them. It lingers on features I wouldn’t normally wouldn’t linger on (like someone’s face) which turned it into this entirely new and strange experience. In general, it takes the mundane and makes it into something kind of glorious—a piece of the world to be marveled at, to be wondered at, a thing to be curious about.
But it’s not just mundanity either; it reminds you that you’re in a vast universe, on a planet that not too long ago didn’t contain you. It starts with a close up of a cave painting, and it ends with this haunting scene of a rocket falling down to Earth. And I remember really grokking, at the end of it, just how strange and just how powerful a thing intelligence is—the magnitude of what we’ve accomplished. I’d had that feeling before, but something about it really stayed with me after watching this film.

aysja Nov 7, 2024, 7:49 AM
24 points
5
on: Anthropic: Three Sketches of ASL-4 Safety Case Components
The interpretability safety case section seems to make the rather strong claim that the activation of known sabotage-related features is necessary for sabotage to occur (emphasis mine):
[C1.1] The model will be unable to accomplish organizational sabotage (T1) or to undermine oversight (T3) without triggering our feature-based monitoring (M1.1) and a detailed analysis of the session.
Afaict, the way that they establish this link (that a feature activation is necessary for a behavior to occur) is through ablation studies. Basically, if deactivating a feature prevents a model organism from committing a dangerous behavior, then they conclude that the feature causes the behavior. And although it does seem like this sort of test will generate Bayesian evidence about which features contribute to sabotage-capacity, I’m confused why the author thinks this would generate strong Bayesian evidence, much less identify the necessity conditions for sabotage with ~certainty?
Like, it seems plausible to me that some other unknown combination of features might create sabotage-like behavior, or that some relevant features were missed, or that there were other, ill-described-as-features mechanisms which can cause sabotage, etc. And my guess is that the evidence this sort of testing does generate will become weaker over time—i.e., once our safety precautions start to really matter—since more intelligent systems will presumably be able to cause sabotage in more numerous and creative ways.
Some of these potential setbacks are mentioned, and I get that this is just a preliminary sketch. But still, I’m confused where the confidence is coming from—why does the author expect this process to yield high confidence of safety?

aysja Nov 5, 2024, 5:18 AM
26 points
8
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
I think I probably agree, although I feel somewhat wary about it. My main hesitations are:
1. The lack of epistemic modifiers seems off to me, relative to the strength of the arguments they’re making. Such that while I agree with many claims, my imagined reader who is coming into this with zero context is like “why should I believe this?” E.g., “Without intervention, humanity will be summarily outcompeted and relegated to irrelevancy,” which like, yes, but also—on what grounds should I necessarily conclude this? They gave some argument along the lines of “intelligence is powerful,” and that seems probably true, but imo not enough to justify the claim that it will certainly lead to our irrelevancy. All of this would be fixed (according to me) if it were framed more as like “here are some reasons you might be pretty worried,” of which there are plenty, or “here’s what I think,” rather than “here is what will definitely happen if we continue on this path,” which feels less certain/obvious to me.
2. Along the same lines, I think it’s pretty hard to tell whether this piece is in good faith or not. E.g., in the intro Connor writes “The default path we are on now is one of ruthless, sociopathic corporations racing toward building the most intelligent, powerful AIs as fast as possible to compete with one another and vie for monopolization and control of both the market and geopolitics.” Which, again, I don’t necessarily disagree with, but my imagined reader with zero context is like “what, really? sociopaths? control over geopolitics?” I.e., I’m expecting readers to question the integrity of the piece, and to be more unsure of how to update on it (e.g. “how do I know this whole thing isn’t just a strawman?” etc.).
3. There are many places where they kind of just state things without justifying them much. I think in the best case this might cause readers to think through whether such claims make sense (either on their own, or by reading the hyperlinked stuff—both of which put quite a lot of cognitive load on them), and in the worst case just causes readers to either bounce or kind of blindly swallow what they’re saying. E.g., “Black-Box Evaluations can only catch all relevant safety issues insofar as we have either an exhaustive list of all possible failure modes, or a mechanistic model of how concrete capabilities lead to safety risks.” They say this without argument and then move on. And although I agree with them (having spent a lot of time thinking this through myself), it’s really not obvious at first blush. Why do you need an exhaustive list? One might imagine, for instance, that a small number of tests would generalize well. And do you need mechanistic models? Sometimes medicines work safely without that, etc., etc. I haven’t read the entire Compendium closely, but my sense is that this is not an isolated incident. And I don’t think this is a fatal flaw or anything—they’re moving through a ton of material really fast and it’s hard to give a thorough account of all claims—but it does make me more hesitant to use it as the default “here’s what’s happening” document.
All of that said, I do broadly agree with the set of arguments, and I think it’s a really cool activity for people to write up what they believe. I’m glad they did it. But I’m not sure how comfortable I feel about sending it to people who haven’t thought much about AI.

aysja Oct 18, 2024, 5:59 AM
9 points
13
in reply to: Drake Thomas’s comment on: Dario Amodei — Machines of Loving Grace
“It’s plausible that things could go much faster than this, but as a prediction about what will actually happen, humanity as a whole probably doesn’t want things to get incredibly crazy so fast, and so we’re likely to see something tamer.” I basically agree with that.
I feel confused about how this squares with Dario’s view that AI is “inevitable,” and “driven by powerful market forces.” Like, if humanity starts producing a technology which makes practically all aspects of life better, the idea is that this will just… stop? I’m sure some people will be scared of how fast it’s going, but it’s hard for me to see the case for the market in aggregate incentivizing less of a technology which fixes ~all problems and creates tremendous value. Maybe the idea, instead, is that governments will step in...? Which seems plausible to me, but as Ryan notes, Dario doesn’t say this.

aysja Oct 17, 2024, 11:05 PM
13 points
2
in reply to: RobertM’s comment on: Anthropic’s updated Responsible Scaling Policy
Thanks, I think you’re right on both points—that the old RSP also didn’t require pre-specified evals, and that the section about Capability Reports just describes the process for non-threshold-triggering eval results—so I’ve retracted those parts of my comment; my apologies for the error. I’m on vacation right now so was trying to read quickly, but I should have checked more closely before commenting.
That said, it does seem to me like the “if/then” relationships in this RSP have been substantially weakened. The previous RSP contained sufficiently much wiggle room that I didn’t interpret it as imposing real constraints on Anthropic’s actions; but it did at least seem to me to be aiming at well-specified “if’s,” i.e., ones which depended on the results of specific evaluations. Like, the previous RSP describes their response policy as: “If an evaluation threshold triggers, we will follow the following procedure” (emphasis mine), where the trigger for autonomous risk happens if “at least 50% of the tasks are passed.”
In other words, the “if’s” in the first RSP seemed more objective to me; the current RSP strikes me as a downgrade in that respect. Now, instead of an evaluation threshold, the “if” is determined by some opaque internal process at Anthropic that the document largely doesn’t describe. I think in practice this is what was happening before—i.e., that the policy basically reduced to Anthropic crudely eyeballing the risk—but it’s still disappointing to me to see this level of subjectivity more actively codified into policy.
My impression is also that this RSP is more Orwellian than the first one, and this is part of what I was trying to gesture at. Not just that their decision process has become more ambiguous and subjective, but that the whole thing seems designed to be glossed over, such that descriptions of risks won’t really load in readers’ minds. This RSP seems much sparser on specifics, and much heavier on doublespeak—e.g., they use the phrase “unable to make the required showing” to mean “might be terribly dangerous.” It also seems to me to describe many things too vaguely to easily argue against. For example, they claim they will “explain why the tests yielded such results,” but my understanding is that this is mostly not possible yet, i.e., that it’s an open scientific question, for most such tests, why their models produce the behavior they do. But without knowing what “tests” they mean, nor the sort of explanations they’re aiming for, it’s hard to argue with; I’m suspicious this is intentional.

aysja Oct 16, 2024, 1:59 PM
52 points
11
on: Anthropic’s updated Responsible Scaling Policy
In the previous RSP, I had the sense that Anthropic was attempting to draw red lines—points at which, if models passed certain evaluations, Anthropic committed to pause and develop new safeguards. That is, if ~~evaluations triggered,~~ ~~then~~ they would implement safety measures. The “if” was already sketchy in the first RSP, as Anthropic was allowed to “determine whether the evaluation was overly conservative,” i.e., they were allowed to retroactively declare red lines green. Indeed, with such caveats it was difficult for me to see the RSP as much more than a declared intent to act responsibly, rather than a commitment. But the updated RSP seems to be far worse, even, than that: the “if” is no longer dependent on the outcomes of pre-specified evaluations, but on the personal judgment of Dario Amodei and Jared Kaplan.
~~Indeed, such red lines are now made~~ ~~more~~ implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.
~~This seems strictly worse to me. Some room for flexibility is warranted, but this strikes me as almost~~ ~~maximally~~ flexible, in that practically nothing is predefined—not evaluations, nor safeguards, nor responses to evaluations. This update makes the RSP more subjective, qualitative, and ambiguous. And if Anthropic is going to make the RSP weaker, I wish this were noted more as an apology, or along with a promise to rectify this in the future. Especially because after a year, Anthropic presumably has more information about the risk than before. Why, then, is even more flexibility needed now? What ~~would~~ ~~cause Anthropic to make clear commitments?~~
I also find it unsettling that the ASL-3 risk threshold has been substantially changed, and the reasoning for this is not explained. In the first RSP, a model was categorized as ASL-3 if it was capable of various precursors for autonomous replication. Now, this has been downgraded to a “checkpoint,” a point at which they promise to evaluate the situation more thoroughly, but don’t commit to taking any particular actions:
We replaced our previous autonomous replication and adaption (ARA) threshold with a “checkpoint” for autonomous AI capabilities. Rather than triggering higher safety standards automatically, reaching this checkpoint will prompt additional evaluation of the model’s capabilities and accelerate our preparation of stronger safeguards.
This strikes me as a big change. The ability to self-replicate is already concerning, but the ability to perform AI R&D seems potentially catastrophic, risking loss of control or extinction. Why does Anthropic now think this shouldn’t count as ASL-3? Why have they substituted this criteria with a substantially riskier one instead?
Dario estimates the probability of something going “really quite catastrophically wrong, on the scale of human civilization” as between 10-25%. He also thinks this might happen soon—perhaps between 2025-2027. It seems obvious to me that a policy this ambiguous, this dependent on figuring things out on the fly, this beset with such egregious conflicts of interest, is a radically insufficient means of managing risk from a technology which poses so grave and imminent a threat to our world.