yams
I think the reason nobody will do anything useful-to-John as a result of the control critique post is that control is explicitly not aiming at the hard parts of the problem, and knows this about itself. In that way, control is an especially poorly selected target if the goal is getting people to do anything useful-to-John. I’d be interested in a similar post on the Alignment Faking paper (or model organisms more broadly), on RAT, on debate, on faithful CoT, on specific interpretability paradigms (circuits v SAEs, vs some coherentist approach vs shards vs....), and would expect those to have higher odds of someone doing something useful-to-John. But useful-to-John isn’t really the metric I think the field should be using, either....
I’m kind of picking on you here because you are least guilty of this failing relative to researchers in your reference class. You are actually saying anything at all, sometimes with detail, about how you feel about particular things. However, you wouldn’t be my first-pick judge for what’s useful; I’d rather live in a world where like half a dozen people in your reference class are spending non-zero time arguing about the details of the above agendas and how they interface with your broader models, so that the researchers working on those things can update based on those critiques (there may even be ways for people to apply the vector implied by y’all’s collective input, and generate something new / abandon their doomed plans).
there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem
There are plenty of cases where John can glance at what people are doing and see pretty clearly that it is not progress toward the hard problem.
Importantly, people with the agent foundations class of anxieties (which I embrace; I think John is worried about the right things!) do not spend time engaging on a gears level with prominent prosaic paradigms and connecting the high level objection (“it ignores the hard part of the problem”) with the details of the research.
“But Tsvi and John actually spend a lot of time doing this.”
No, they don’t! They paraphrase the core concern over and over again, often seemingly without reading the paper. I don’t think reading the paper would change your minds (nor should it!), but I think that there’s a culture problem tied to this off-hand dismissal of prosaic work that disincentivizes potential agent foundations (or similar new thing that shares the core concerns of agent foundations) researchers from engaging with, i.e., John.
Prosaic work is fraught and, much of it, doomed. New researchers over-index on tractability because short feedback loops are comforting (‘street-lighting’). Why aren’t we explaining why that is, on the terms of the research itself, rather than expecting people to be persuaded by the same high level point getting hammered into them again and again?
I’ve watched this work in real-time. If you listen to someone talk about their work, or read their paper and follow up in person, they are often receptive to a conversation about worlds in which their work is ineffective, evidence that we’re likely to be in such a world, and even to shifting the direction of their work in recognition of that evidence.
Instead, people with their eye on the ball are doing this tribalistic(-seeming) thing.
Yup, the deck is stacked against humanity solving the hard problems; for some reason, folks who know that are also committed to playing their hands poorly, and then blaming (only) the stacked deck!
John’s recent post on control is a counter-example to the above claims and was, broadly, a big step in the right direction, but had some issues with it, as raised by Redwood in the comments, which are a natural consequence of it being ~a new thing John was doing. I look forward to more posts like that in the future, from John and others, that help new entrants to empirical work (which has a robust talent pipeline!) understand, integrate, and even pivot in response to, the hard parts of the problem.
[edit: I say ‘gears level’ a couple times, but mean ‘more in the direction of gears-level than the critiques that have existed so far’]
If you wrote this exact post, it would have been upvoted enough for the Redwood team to see it, and they would have engaged with you similarly to how they engaged with John here (modulo some familiarity, because theyse people all know each other at least somewhat, and in some pairs very well actually).
If you wrote several posts like this, that were of some quality, you would lose the ability to appeal to your own standing as a reason not to write a post.
This is all I’m trying to transmit.
[edit: I see you already made the update I was encouraging, an hour after leaving the above comment to me. Yay!]
Writing (good) critiques is, in fact, a way many people gain standing. I’d push back on the part of you that thinks all of your good ideas will be ignored (some of them probably will be, but not all of them; don’t know until you try, etc).
More partial credit on the second to last point:
https://home.treasury.gov/news/press-releases/jy2766
Aside: I don’t think it’s just that real world impacts take time to unfold. Lately I’ve felt that evals are only very weakly predictive of impact (because making great ones is extremely difficult). Could be that models available now don’t have substantially more mundane utility (economic potential stemming from first order effects), outside of the domains the labs are explicitly targeting (like math and code), than models available 1 year ago.
Is the context on “reliable prediction and ELK via empirical route” just “read the existing ELK literature and actually follow it” or is it stuff that’s not written down? I assume you’ve omitted it to save time, and so no worries if the latter.
EDIT: I was slightly tempted to think of this also as ‘Ryan’s ranking of live agendas that aren’t control’, but I’m not sure if ‘what you expect to work conditional on delegating to AIs’ is similar to ‘what you expect to work if humans are doing most of it?’ (my guess is the lists would look similar, but with notable exceptions, eg humans pursuing GOFAI feels less viable than ML agents pursuing GOFAI)
My understanding is that ~6 months ago y’all were looking for an account of the tasks an automated AI safety researcher would hopefully perform, as part of answering the strategic question ‘what’s the next step after building [controlled] AGI?’ (with ‘actually stop there indefinitely’ being a live possibility)
This comment makes me think you’ve got that account of safety tasks to be automated, and are feeling optimistic about automated safety research.
Is that right and can you share a decently mechanistic account of how automated safety research might work?
[I am often skeptical of, to straw man the argument, ‘make ai that makes ai safe’, got the sense Redwood felt similarly, and now expect this may have changed.]
Thanks for the clarification — this is in fact very different from what I thought you were saying, which was something more like “FATE-esque concerns fundamentally increase x-risk in ways that aren’t just about (1) resource tradeoffs or (2) side-effects of poorly considered implementation details.”
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
Can you say more about the section I’ve bolded or link me to a canonical text on this tradeoff?
[was a manager at MATS until recently and want to flesh out the thing Buck said a bit more]
It’s common for researchers to switch subfields, and extremely common for MATS scholars to get work doing something different from what they did at MATS. (Kosoy has had scholars go on to ARC, Neel scholars have ended up in scalable oversight, Evan’s scholars have a massive spread in their trajectories; there are many more examples but it’s 3 AM.)
Also I wouldn’t advise applying to something that seems interesting; I’d advise applying for literally everything (unless you know for sure you don’t want to work with Neel, since his app is very time intensive). The acceptance rate is ~4 percent, so better to maximize your odds (again, for most scholars, the bulk of the value is not in their specific research output over the 10 week period, but in having the experience at all).
Also please see Ryan’s replies to Tsvi on the talent needs report for more notes on the street lighting concern as it pertains to MATS. There’s a pretty big back and forth there (I don’t cleanly agree with one side or the other, but it might be useful to you).
Your version of events requires a change of heart (for ‘them to get a whole lot more serious’). I’m just looking at the default outcome. Whether alignment is hard or easy (although not if it’s totally trivial), it appears to be progressing substantially more slowly than capabilities (and the parts of it that are advancing are the most capabilities-synergizing, so it’s unclear what the oft-lauded ‘differential advancement of safety’ really looks like).
By bad I mean dishonest, and by ‘we’ I mean the speaker (in this case, MIRI).
I take myself to have two central claims across this thread:
Your initial comment was straw manning the ‘if we build [ASI], we all die’ position.
MIRI is likely not a natural fit to consign itself to service as the neutral mouthpiece of scientific consensus.
I do not see where your most recent comment has any surface area with either of these claims.
I do want to offer some reassurance, though:
I do not take “One guy who’s thought about this for a long time and some other people he recruited think it’s definitely going to fail” to be descriptive of the MIRI comms strategy.
Oh, I feel fine about saying ‘draft artifacts currently under production by the comms team ever cite someone who is not Eliezer, including experts with a lower p(doom)’ which, based on this comment, is what I take to be the goalpost. This is just regular coalition signaling though and not positioning yourself as, terminally, a neutral observer of consensus.
“You haven’t really disagreed that [claiming to speak for scientific consensus] would be more effective.”
That’s right! I’m really not sure about this. My experience has been that ~every take someone offers to normies in policy is preceded by ‘the science says…’, so maybe the market is kind of saturated here. I’d also worry that precommitting to only argue in line with the consensus might bind you to act against your beliefs (and I think EY et al have valuable inside-view takes that shouldn’t be stymied by the trends of an increasingly-confused and poisonous discourse). That something is a local credibility win (I’m not sure if it is, actually) doesn’t mean it’s got the best nth order effects among all options long-term (including on the dimension of credibility).
I believe that Seth would find messaging that did this more credible. I think ‘we’re really not sure’ is a bad strategy if you really are sure, which MIRI leadership, famously, is.
I do mean ASI, not AGI. I know Pope + Belrose also mean to include ASI in their analysis, but it’s still helpful to me if we just use ASI here, so I’m not constantly wondering if you’ve switched to thinking about AGI.
Obligatory ‘no really, I am not speaking for MIRI here.’
My impression is that MIRI is not trying to speak for anyone else. Representing the complete scientific consensus is an undue burden to place on an org that has not made that claim about itself. MIRI represents MIRI, and is one component voice of the ‘broad view guiding public policy’, not its totality. No one person or org is in the chair with the lever; we’re all just shouting what we think in directions we expect the diffuse network of decision-makers to be sitting in, with more or less success. It’s true that ‘claiming to represent the consensus’ is a tacking one can take to appear authoritative, and not (always) a dishonest move. To my knowledge, this is not MIRI’s strategy. This is the strategy of, ie, the CAIS letter (although not of CAIS as a whole!), and occasionally AIS orgs cite expert consensus or specific, otherwise-disagreeing experts as having directional agreement with the org (for an extreme case, see Yann LeCun shortening his timelines). This is not the same as attempting to draw authority from the impression that one’s entire aim is simply ‘sharing consensus.’
And then my model of Seth says ‘Well we should have an org whose entire strategy is gathering and sharing expert consensus, and I’m disappointed that this isn’t MIRI, because this is a better strategy,’ or else cites a bunch of recent instances of MIRI claiming to represent scientific consensus (afaik these don’t exist, but it would be nice to know if they do). It is fair for you to think MIRI should be doing a different thing. Imo MIRI’s history points away from it being a good fit to take representing scientific consensus as its primary charge (and this is, afaict, part of why AI Impacts was a separate project).
I think MIRI comms are by and large well sign-posted to indicate ‘MIRI thinks x’ or ‘Mitch thinks y’ or ‘Bengio said z.’ If you think a single org should build influence and advocate for a consensus view then help found one, or encourage someone else to do so. This just isn’t what MIRI is doing.
Good point—what I said isn’t true in the case of alignment by default.
Edited my initial comment to reflect this
(I work at MIRI but views are my own)
I don’t think ‘if we build it we all die’ requires that alignment be hard [edit: although it is incompatible with alignment by default]. It just requires that our default trajectory involves building ASI before solving alignment (and, looking at our present-day resource allocation, this seems very likely to be the world we are in, conditional on building ASI at all).
[I want to note that I’m being very intentional when I say “ASI” and “solving alignment” and not “AGI” and “improving the safety situation”]
What text analogizing LLMs to human brains have you found most compelling?
Does it seem likely to you that, conditional on ‘slow bumpy period soon’, a lot of the funding we see at frontier labs dries up (so there’s kind of a double slowdown effect of ‘the science got hard, and also now we don’t have nearly the money we had to push global infrastructure and attract top talent’), or do you expect that frontier labs will stay well funded (either by leveraging low hanging fruit in mundane utility, or because some subset of their funders are true believers, or a secret third thing)?
Only the first few sections of the comment were directed at you; the last bit was a broader point re other commenters in the thread, the fooming shoggoths, and various in-person conversations I’ve had with people in the bay.
That rationalists and EAs tend toward aesthetic bankruptcy is one of my chronic bones to pick, because I do think it indicates the presence of some bias that doesn’t exist in the general population, which results in various blind spots.
Sorry for not signposting and/or limiting myself to a direct reply; that was definitely confusing.
I think you should give 1 or 2 a try, and would volunteer my time (although if you’d find a betting structure more enticing, we could say my time is free iff I turn out to be wrong, and otherwise you’d pay me).
Question for Ben:
Are you inviting us to engage with the object level argument, or are you drawing attention to the existence of this argument from a not-obviously-unreasonable-source as a phenomenon we are responsible for (and asking us to update on that basis)?
On my read, he’s not saying anything new (concerns around military application are why ‘we’ mostly didn’t start going to the government until ~2-3 years ago), but that he’s saying it, while knowing enough to paint a reasonable-even-to-me picture of How This Thing Is Going, is the real tragedy.