yams

Karma: 322

MIRI, formerly MATS, sometimes Palisade

Existing Safety Frameworks Imply Unreasonable Confidence

Joe Rogero, yams and Joe Collman

Apr 10, 2025, 4:31 PM

36 points

1 comment15 min readLW link

(intelligence.org)

yams Mar 21, 2025, 5:21 PM
20 points
14
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
Ah, I think this just reads like you don’t think of romantic relationships as having any value proposition beyond the sexual, other than those you listed (which are Things but not The Thing, where The Thing is some weird discursive milieu). Also the tone you used for describing the other Things is as though they are traps that convince one, incorrectly, to ‘settle’, rather than things that could actually plausibly outweigh sexual satisfaction.
Different people place different weight on sexual satisfaction (for a lot of different reasons, including age).
I’m mostly just trying to explain all the disagree votes. I think you’ll get the most satisfying answer to your actual question by having a long chat with one of your asexual friends (as something like a control group, since the value of sex to them is always 0 anyway, so whatever their cause is for having romantic relationships is probably the kind of thing that you’re looking for here).

yams Mar 20, 2025, 4:49 PM
13 points
5
in reply to: Richard_Ngo’s comment on: Elite Coordination via the Consensus of Power
I read your comment as conflating ‘talking about the culture war at all’ and ‘agreeing with / invoking Curtis Yarvin’, which also conflates ‘criticizing Yarvin’ with ‘silencing discussion of the culture war’.
This reinforces a false binary between totally mind-killed wokists and people (like Yarvin) who just literally believe that some folks deserve to suffer, because it’s their genetic destiny.
This kind of tribalism is exactly what fuels the culture war, and not what successfully sidesteps, diffuses, or rectifies it. NRx, like the Cathedral, is a mind-killing apparatus, and one can cautiously mine individual ideas presented by either side, on the basis of the merits of that particular idea, while understanding that there is, in fact, very little in the way of a coherent model underlying those claims. Or, to the extent that there is such a model, it doesn’t survive (much) contact with reality.
[it feels useful for me to point out that Yarvin has ever said things I agree with, and that I’m sympathetic to some of the main-line wokist positions, to avoid the impression that I’m merely a wokist cosplaying centrism; in fact, the critiques of wokism I find most compelling are the critiques that come from the left, but it’s also true that Yarvin has some views here that are more in contact with reality]
edit: I agree that people should say things they believe and be engaged with in good faith (conditional on they, themselves, are engaging in good faith)

yams Mar 20, 2025, 4:30 PM
2 points
0
in reply to: Garrett Baker’s comment on: Elite Coordination via the Consensus of Power
I think you’re saying something here but I’m going to factor it a bit to be sure.
1. “not exactly hard-hitting”
2. “not… at all novel”
3. “not… even interesting”
4. “not even criticisms of the humanities”
One and three I’m just going to call ‘subjective’ (and I think I would just agree with you if the Wikipedia article were actually representative of the contents of the book, which it is not).
Re 4: The book itself is actually largely about his experiences as a professor, being subjected to the forces of elite coordination and bureaucracy, and reads a lot like Yarvin’s critiques of the Cathedral (although Fisher identifies these as representative of a pseudo-left).
Re 2: The novelty comes from the contemporaneity of the writing. Fisher is doing a very early-20th century Marxist thing of actually talking about one’s experience of the world, and relating that back to broader trends, in plain language. The world has changed enough that the work has become tragically dated, and I personally wouldn’t recommend it to people who aren’t already somewhat sympathetic to his views, since its strength around the time of its publication (that contemporaneity) has, predictably, becomes its weakness.
The work that more does the thing testingthewaters is gesturing toward, imo, is Exiting the Vampire Castle. The views expressed in this work are directly upstream of his death: his firm (and early) rebuke of cancel culture and identity politics precipitated rejection and bullying from other leftists on twitter, deepening his depression. He later killed himself.
Important note if you actually read the essay: he’s setting his aim at similar phenomena to Yarvin, but is identifying the cause differently // he is a leftist talking to other leftists, so is using terms like ‘capital’ in a valenced way. I think the utility of this work, for someone who is not part of the audience he is critiquing, is that it shows that the left has any answer at all to the phenomena Yarvin and Ngo are calling out; that they’re not, wholesale, oblivious to these problems and, in fact, the principal divide in the contemporary left is between those who reject the Cathedral and those who seek to join it.
(obligatory “Nick Land was Mark Fisher’s dissertation advisor.”)

yams Mar 16, 2025, 11:58 PM
8 points
4
on: I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?
(I basically endorse Daniel and Habryka’s comments, but wanted to expand the ‘it’s tricky’ point about donation. Obviously, I don’t know what they think, and they likely disagree on some of this stuff.)
There are a few direct-work projects that seem robustly good (METR, Redwood, some others) based on track record, but afaict they’re not funding constrained.
Most incoming AI safety researchers are targeting working at the scaling labs, which doesn’t feel especially counterfactual or robust against value drift, from my position. For this reason, I don’t think prosaic AIS field-building should be a priority investment (and Open Phil is prioritizing this anyway, so marginal value per dollar is a good deal lower than it was a few years ago).
There are various governance things happening, but much of that work is pretty behind the scenes.
There are also comms efforts, but the community as a whole has just been spinning up capacity in this direction for ~a year, and hasn’t really had any wild successes, beyond a few well-placed op-eds (and the juries out on if / which direction these moved the needle).
Comms is a devilishly difficult thing to do well, and many fledgling efforts I’ve encountered in this direction are not in the hands of folks whose strategic capacities I especially trust. I could talk at length about possible comms failure modes if anyone has questions.
I’m very excited about Palisade and Apollo, which are both, afaict, somewhat funding constrained in the sense that they have fewer people than they should, and the people currently working there are working for less money than they could get at another org, because they believe in the theory of change over other theories of change. I think they should be better supported than they are currently, on a raw dollars level (but this may change in the future, and I don’t know how much money they need to receive in order for that to change).
I am not currently empowered to make a strong case for donating to MIRI using only publicly available information, but that should change by the end of this year, and the case to be made there may be quite strong. (I say this because you may click my profile and see I work at MIRI, and so it would seem a notable omission from my list if I didn’t mention why it’s omitted; reasons for donating to MIRI exist, but they’re not public, and I wouldn’t feel right trying to convince anyone of that, especially when I expect it to become pretty obvious later).
I don’t know how much you know about AI safety and the associated ecosystem but, from my (somewhat pessimistic, non-central) perspective, many of the activities in the space are likely (or guaranteed, in some instances) to have the opposite of their stated intended impact. Many people will be happy to take your money and tell you it’s doing good, but knowing that it is doing good by your own lights (as opposed to doing evil or, worse, doing nothing*) is the hard part. There is ~no consensus view here, and no single party that I would trust to make this call with my money without my personal oversight (which I would also aim to bolster through other means, in advance of making this kind of call).
*this was a joke. Don’t Be Evil.

yams Feb 12, 2025, 1:46 AM
1 point
0
in reply to: davekasten’s comment on: davekasten’s Shortform
Preliminary thoughts from Ryan Greenblatt on this here.

yams Feb 1, 2025, 6:53 PM
1 point
0
on: yams’s Shortform
[errant thought pointing a direction, low-confidence musing, likely retreading old ground]

There’s a disagreement that crops up in conversations about changing people’s minds. Sides are roughly:
1. You should explain things by walking someone through your entire thought process, as it actually unfolded. Changing minds is best done by offering an account of how your own mind was changed.
2. You should explain things by back-chaining the most viable (valid) argument, from your conclusions, with respect to your specific audience.
This first strategy invites framing your argument around the question “How did I come to change my mind?”, and this second invites framing your argument around the question “How might I change my audience’s mind?”. I am sometimes characterized as advocating for approach 2, and have never actually taken that to be my position.I think there’s a third approach here, which will look to advocates of approach 1 as if it were approach 2, and look to advocates of approach 2 as if it were approach 1. That is, you should frame the strategy around the question “How might my audience come to change their mind?”, and then not even try to change it yourself.
This third strategy is about giving people handles and mechanisms that empower them to update based on evidence they will encounter in the natural course of their lives, rather than trying to do all of the work upfront. Don’t frame your own position as some competing argument in the market place of ideas; hand your interlocutor a tool, tell them what they might expect, and let their experience confirm your predictions.I think this approach has a few major differences over the other two approaches, from the perspective of its impact:
1. It requires much less authority. (Strength!)
2. It can be executed in a targeted, light-weight fashion. (Strength!)
3. It’s less likely to slip into deception than option 2, and less confrontational than option 1. (Strength!)
4. Even if it works, they won’t end up thinking exactly what you think (Weakness?)
5. ….but they’ll be better equipped to make sense of new evidence (Strength!)
6. Plausibly more mimetically fit than option 1 or 2 (a failure mode of 1 is that your interlocutor won’t be empowered to stand up to criticism while spreading the ideas, even to people very much like themselves, and for option 2 it’s that they will ONLY be successful in spreading the idea to people who are like themselves, since they only know the argument that works on them).
I think Eliezer has talked about some version of this in the past, and this is part of why people like predictions in general, but I think pasting a prediction at the end of an argument built around strategy 1 or 2 isn’t actually Doing The Thing I mean here.
Friends report Logan’s writing strongly has this property.

yams Feb 1, 2025, 6:53 PM
1 point
0
on: yams’s Shortform
Do you think of rationality as a similar sort of ‘object’ or ‘discipline’ to philosophy? If not, what kind of object do you think of it as being?
(I am no great advocate for academic philosophy; I left that shit way behind ~a decade ago after going quite a ways down the path. I just want to better understand whether folks consider Rationality as a replacement for philosophy, a replacement for some of philosophy, a subset of philosophical commitments, a series of cognitive practices, or something else entirely. I can model it, internally, as aiming to be any of these things, without other parts of my understanding changing very much, but they all have ‘gaps’, where there are things that I associate with Rationality that don’t actually naturally fall out of the core concepts as construed as any of these types of category [I suppose this is the ‘being a subculture’ x-factor]).

yams Feb 1, 2025, 2:15 AM
5 points
0
on: The Failed Strategy of Artificial Intelligence Doomers
Question for Ben:
Are you inviting us to engage with the object level argument, or are you drawing attention to the existence of this argument from a not-obviously-unreasonable-source as a phenomenon we are responsible for (and asking us to update on that basis)?
On my read, he’s not saying anything new (concerns around military application are why ‘we’ mostly didn’t start going to the government until ~2-3 years ago), but that he’s saying it, while knowing enough to paint a reasonable-even-to-me picture of How This Thing Is Going, is the real tragedy.

yams Jan 31, 2025, 6:58 PM
7 points
0
in reply to: johnswentworth’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I think the reason nobody will do anything useful-to-John as a result of the control critique post is that control is explicitly not aiming at the hard parts of the problem, and knows this about itself. In that way, control is an especially poorly selected target if the goal is getting people to do anything useful-to-John. I’d be interested in a similar post on the Alignment Faking paper (or model organisms more broadly), on RAT, on debate, on faithful CoT, on specific interpretability paradigms (circuits v SAEs, vs some coherentist approach vs shards vs....), and would expect those to have higher odds of someone doing something useful-to-John. But useful-to-John isn’t really the metric I think the field should be using, either....
I’m kind of picking on you here because you are least guilty of this failing relative to researchers in your reference class. You are actually saying anything at all, sometimes with detail, about how you feel about particular things. However, you wouldn’t be my first-pick judge for what’s useful; I’d rather live in a world where like half a dozen people in your reference class are spending non-zero time arguing about the details of the above agendas and how they interface with your broader models, so that the researchers working on those things can update based on those critiques (there may even be ways for people to apply the vector implied by y’all’s collective input, and generate something new / abandon their doomed plans).

yams Jan 31, 2025, 5:25 PM
5 points
0
in reply to: johnswentworth’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem
There are plenty of cases where John can glance at what people are doing and see pretty clearly that it is not progress toward the hard problem.
Importantly, people with the agent foundations class of anxieties (which I embrace; I think John is worried about the right things!) do not spend time engaging on a gears level with prominent prosaic paradigms and connecting the high level objection (“it ignores the hard part of the problem”) with the details of the research.
“But Tsvi and John actually spend a lot of time doing this.”
No, they don’t! They paraphrase the core concern over and over again, often seemingly without reading the paper. I don’t think reading the paper would change your minds (nor should it!), but I think that there’s a culture problem tied to this off-hand dismissal of prosaic work that disincentivizes potential agent foundations (or similar new thing that shares the core concerns of agent foundations) researchers from engaging with, i.e., John.
Prosaic work is fraught and, much of it, doomed. New researchers over-index on tractability because short feedback loops are comforting (‘street-lighting’). Why aren’t we explaining why that is, on the terms of the research itself, rather than expecting people to be persuaded by the same high level point getting hammered into them again and again?
I’ve watched this work in real-time. If you listen to someone talk about their work, or read their paper and follow up in person, they are often receptive to a conversation about worlds in which their work is ineffective, evidence that we’re likely to be in such a world, and even to shifting the direction of their work in recognition of that evidence.
Instead, people with their eye on the ball are doing this tribalistic(-seeming) thing.
Yup, the deck is stacked against humanity solving the hard problems; for some reason, folks who know that are also committed to playing their hands poorly, and then blaming (only) the stacked deck!
John’s recent post on control is a counter-example to the above claims and was, broadly, a big step in the right direction, but had some issues with it, as raised by Redwood in the comments, which are a natural consequence of it being ~a new thing John was doing. I look forward to more posts like that in the future, from John and others, that help new entrants to empirical work (which has a robust talent pipeline!) understand, integrate, and even pivot in response to, the hard parts of the problem.
[edit: I say ‘gears level’ a couple times, but mean ‘more in the direction of gears-level than the critiques that have existed so far’]

yams Jan 23, 2025, 12:16 AM
2 points
0
in reply to: Seth Herd’s comment on: The Case Against AI Control Research
If you wrote this exact post, it would have been upvoted enough for the Redwood team to see it, and they would have engaged with you similarly to how they engaged with John here (modulo some familiarity, because theyse people all know each other at least somewhat, and in some pairs very well actually).
If you wrote several posts like this, that were of some quality, you would lose the ability to appeal to your own standing as a reason not to write a post.
This is all I’m trying to transmit.
[edit: I see you already made the update I was encouraging, an hour after leaving the above comment to me. Yay!]

yams Jan 22, 2025, 7:42 PM
1 point
0
in reply to: Seth Herd’s comment on: The Case Against AI Control Research
Writing (good) critiques is, in fact, a way many people gain standing. I’d push back on the part of you that thinks all of your good ideas will be ignored (some of them probably will be, but not all of them; don’t know until you try, etc).

yams Jan 2, 2025, 9:20 AM
1 point
0
on: Grading my 2024 AI predictions
More partial credit on the second to last point:

https://home.treasury.gov/news/press-releases/jy2766

Aside: I don’t think it’s just that real world impacts take time to unfold. Lately I’ve felt that evals are only very weakly predictive of impact (because making great ones is extremely difficult). Could be that models available now don’t have substantially more mundane utility (economic potential stemming from first order effects), outside of the domains the labs are explicitly targeting (like math and code), than models available 1 year ago.

yams Jan 1, 2025, 2:37 AM
1 point
0
in reply to: ryan_greenblatt’s comment on: Buck’s Shortform
Is the context on “reliable prediction and ELK via empirical route” just “read the existing ELK literature and actually follow it” or is it stuff that’s not written down? I assume you’ve omitted it to save time, and so no worries if the latter.

EDIT: I was slightly tempted to think of this also as ‘Ryan’s ranking of live agendas that aren’t control’, but I’m not sure if ‘what you expect to work conditional on delegating to AIs’ is similar to ‘what you expect to work if humans are doing most of it?’ (my guess is the lists would look similar, but with notable exceptions, eg humans pursuing GOFAI feels less viable than ML agents pursuing GOFAI)

yams Jan 1, 2025, 12:35 AM
5 points
0
in reply to: ryan_greenblatt’s comment on: Buck’s Shortform
My understanding is that ~6 months ago y’all were looking for an account of the tasks an automated AI safety researcher would hopefully perform, as part of answering the strategic question ‘what’s the next step after building [controlled] AGI?’ (with ‘actually stop there indefinitely’ being a live possibility)

This comment makes me think you’ve got that account of safety tasks to be automated, and are feeling optimistic about automated safety research.

Is that right and can you share a decently mechanistic account of how automated safety research might work?

[I am often skeptical of, to straw man the argument, ‘make ai that makes ai safe’, got the sense Redwood felt similarly, and now expect this may have changed.]

yams Dec 29, 2024, 11:54 PM
3 points
0
in reply to: habryka’s comment on: evhub’s Shortform
Thanks for the clarification — this is in fact very different from what I thought you were saying, which was something more like “FATE-esque concerns fundamentally increase x-risk in ways that aren’t just about (1) resource tradeoffs or (2) side-effects of poorly considered implementation details.”

yams Dec 29, 2024, 11:40 PM
1 point
0
in reply to: habryka’s comment on: evhub’s Shortform
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter

Can you say more about the section I’ve bolded or link me to a canonical text on this tradeoff?

yams Dec 27, 2024, 11:37 AM
9 points
2
in reply to: Tahp’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
[was a manager at MATS until recently and want to flesh out the thing Buck said a bit more]

It’s common for researchers to switch subfields, and extremely common for MATS scholars to get work doing something different from what they did at MATS. (Kosoy has had scholars go on to ARC, Neel scholars have ended up in scalable oversight, Evan’s scholars have a massive spread in their trajectories; there are many more examples but it’s 3 AM.)

Also I wouldn’t advise applying to something that seems interesting; I’d advise applying for literally everything (unless you know for sure you don’t want to work with Neel, since his app is very time intensive). The acceptance rate is ~4 percent, so better to maximize your odds (again, for most scholars, the bulk of the value is not in their specific research output over the 10 week period, but in having the experience at all).

Also please see Ryan’s replies to Tsvi on the talent needs report for more notes on the street lighting concern as it pertains to MATS. There’s a pretty big back and forth there (I don’t cleanly agree with one side or the other, but it might be useful to you).

yams Dec 16, 2024, 7:11 PM
3 points
0
in reply to: Seth Herd’s comment on: Communications in Hard Mode (My new job at MIRI)
Your version of events requires a change of heart (for ‘them to get a whole lot more serious’). I’m just looking at the default outcome. Whether alignment is hard or easy (although not if it’s totally trivial), it appears to be progressing substantially more slowly than capabilities (and the parts of it that are advancing are the most capabilities-synergizing, so it’s unclear what the oft-lauded ‘differential advancement of safety’ really looks like).

yams

Ex­ist­ing Safety Frame­works Im­ply Un­rea­son­able Confidence

Existing Safety Frameworks Imply Unreasonable Confidence