The Field of AI Alignment: A Postmortem, and What To Do About It
A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, “this is where the light is”.
Over the past few years, a major source of my relative optimism on AI has been the hope that the field of alignment would transition from pre-paradigmatic to paradigmatic, and make much more rapid progress.
At this point, that hope is basically dead. There has been some degree of paradigm formation, but the memetic competition has mostly been won by streetlighting: the large majority of AI Safety researchers and activists are focused on searching for their metaphorical keys under the streetlight. The memetically-successful strategy in the field is to tackle problems which are easy, rather than problems which are plausible bottlenecks to humanity’s survival. That pattern of memetic fitness looks likely to continue to dominate the field going forward.
This post is on my best models of how we got here, and what to do next.
What This Post Is And Isn’t, And An Apology
This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we’ll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post. In particular, probably the large majority of people in the field have some story about how their work is not searching under the metaphorical streetlight, or some reason why searching under the streetlight is in fact the right thing for them to do, or [...].
The kind and prosocial version of this post would first walk through every single one of those stories and argue against them at the object level, to establish that alignment researchers are in fact mostly streetlighting (and review how and why streetlighting is bad). Unfortunately that post would be hundreds of pages long, and nobody is ever going to get around to writing it. So instead, I’ll link to:
My own Why Not Just… sequence
Nate’s How Various Plans Miss The Hard Bits Of The Alignment Challenge
(Also I might link some more in the comments section.) Please go have the object-level arguments there rather than rehashing everything here.
Next comes the really brutally unkind part: the subject of this post necessarily involves modeling what’s going on in researchers’ heads, such that they end up streetlighting. That means I’m going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I’m being totally unfair. And then when they try to defend themselves in the comments below, I’m going to say “please go have the object-level argument on the posts linked above, rather than rehashing hundreds of different arguments here”. To all those researchers: yup, from your perspective I am in fact being very unfair, and I’m sorry. You are not the intended audience of this post, I am basically treating you like a child and saying “quiet please, the grownups are talking”, but the grownups in question are talking about you and in fact I’m trash talking your research pretty badly, and that is not fair to you at all.
But it is important, and this post just isn’t going to get done any other way. Again, I’m sorry.
Why The Streetlighting?
A Selection Model
First and largest piece of the puzzle: selection effects favor people doing easy things, regardless of whether the easy things are in fact the right things to focus on. (Note that, under this model, it’s totally possible that the easy things are the right things to focus on!)
What does that look like in practice? Imagine two new alignment researchers, Alice and Bob, fresh out of a CS program at a mid-tier university. Both go into MATS or AI Safety Camp or get a short grant or [...]. Alice is excited about the eliciting latent knowledge (ELK) doc, and spends a few months working on it. Bob is excited about debate, and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he’s making progress.
… of course (I would say) Bob has not made any progress toward solving any probable bottleneck problem of AI alignment, but he has tangible outputs and is making progress on something, so he’ll probably keep going.
And that’s what the selection pressure model looks like in practice. Alice is working on something hard, correctly realizes that she has no traction, and stops. (Or maybe she just keeps spinning her wheels until she burns out, or funders correctly see that she has no outputs and stop funding her.) Bob is working on something easy, he has tangible outputs and feels like he’s making progress, so he keeps going and funders keep funding him. How much impact Bob’s work has impact on humanity’s survival is very hard to measure, but the fact that he’s making progress on something is easy to measure, and the selection pressure rewards that easy metric.
Generalize this story across a whole field, and we end up with most of the field focused on things which are easy, regardless of whether those things are valuable.
Selection and the Labs
Here’s a special case of the selection model which I think is worth highlighting.
Let’s start with a hypothetical CEO of a hypothetical AI lab, who (for no particular reason) we’ll call Sam. Sam wants to win the race to AGI, but also needs an AI Safety Strategy. Maybe he needs the safety strategy as a political fig leaf, or maybe he’s honestly concerned but not very good at not-rationalizing. Either way, he meets with two prominent AI safety thinkers—let’s call them (again for no particular reason) Eliezer and Paul. Both are clearly pretty smart, but they have very different models of AI and its risks. It turns out that Eliezer’s model predicts that alignment is very difficult and totally incompatible with racing to AGI. Paul’s model… if you squint just right, you could maybe argue that racing toward AGI is sometimes a good thing under Paul’s model? Lo and behold, Sam endorses Paul’s model as the Official Company AI Safety Model of his AI lab, and continues racing toward AGI. (Actually the version which eventually percolates through Sam’s lab is not even Paul’s actual model, it’s a quite different version which just-so-happens to be even friendlier to racing toward AGI.)
A “Flinching Away” Model
While selection for researchers working on easy problems is one big central piece, I don’t think it fully explains how the field ends up focused on easy things in practice. Even looking at individual newcomers to the field, there’s usually a tendency to gravitate toward easy things and away from hard things. What does that look like?
Carol follows a similar path to Alice: she’s interested in the Eliciting Latent Knowledge problem, and starts to dig into it, but hasn’t really understood it much yet. At some point, she notices a deep difficulty introduced by sensor tampering—in extreme cases it makes problems undetectable, which breaks the iterative problem-solving loop, breaks ease of validation, destroys potential training signals, etc. And then she briefly wonders if the problem could somehow be tackled without relying on accurate feedback from the sensors at all. At that point, I would say that Carol is thinking about the real core ELK problem for the first time.
… and Carol’s thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems. At that point, I would say that Carol is streetlighting.
It’s the reflexive flinch which, on this model, comes first. After that will come rationalizations. Some common variants:
Carol explicitly introduces some assumption simplifying the problem, and claims that without the assumption the problem is impossible. (Ray’s workshop on one-shotting Baba Is You levels apparently reproduced this phenomenon very reliably.)
Carol explicitly says that she’s not trying to solve the full problem, but hopefully the easier version will make useful marginal progress.
Carol explicitly says that her work on easier problems is only intended to help with near-term AI, and hopefully those AIs will be able to solve the harder problems.
(Most common) Carol just doesn’t think about the fact that the easier problems don’t really get us any closer to aligning superintelligence. Her social circles act like her work is useful somehow, and that’s all the encouragement she needs.
… but crucially, the details of the rationalizations aren’t that relevant to this post. Someone who’s flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they’ll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.
Which brings us to the “what to do about it” part of the post.
What To Do About It
Let’s say we were starting a new field of alignment from scratch. How could we avoid the streetlighting problem, assuming the models above capture the core gears?
First key thing to notice: in our opening example with Alice and Bob, Alice correctly realized that she had no traction on the problem. If the field is to be useful, then somewhere along the way someone needs to actually have traction on the hard problems.
Second key thing to notice: if someone actually has traction on the hard problems, then the “flinching away” failure mode is probably circumvented.
So one obvious thing to focus on is getting traction on the problems.
… and in my experience, there are people who can get traction on the core hard problems. Most notably physicists—when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall. I’m picturing here e.g. the sort of crowd at the ILLIAD conference; these were people who mostly did not seem at risk of flinching away, because they saw routes to tackle the problems. (Though to be clear, though ILLIAD was a theory conference, I do not mean to imply that it’s only theorists who ever have any traction.) And they weren’t being selected away, because many of them were in fact doing work and making progress.
Ok, so if there are a decent number of people who can get traction, why do the large majority of the people I talk to seem to be flinching away from the hard parts?
How We Got Here
The main problem, according to me, is the EA recruiting pipeline.
On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.
… and that’s just not a high enough skill level for people to look at the core hard problems of alignment and see footholds.
Who To Recruit Instead
We do not need pretty standard STEM-focused undergrads from pretty standard schools. In practice, the level of smarts and technical knowledge needed to gain any traction on the core hard problems seems to be roughly “physics postdoc”. Obviously that doesn’t mean we exclusively want physics postdocs—I personally have only an undergrad degree, though amusingly a list of stuff I studied has been called “uncannily similar to a recommendations to readers to roll up their own doctorate program”. Point is, it’s the rough level of smarts and skills which matters, not the sheepskin. (And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)
As an alternative to recruiting people who have the skills already, one could instead try to train people. I’ve tried that to some extent, and at this point I think there just isn’t a substitute for years of technical study. People need that background knowledge in order to see footholds on the core hard problems.
Integration vs Separation
Last big piece: if one were to recruit a bunch of physicists to work on alignment, I think it would be useful for them to form a community mostly-separate from the current field. They need a memetic environment which will amplify progress on core hard problems, rather than… well, all the stuff that’s currently amplified.
This is a problem which might solve itself, if a bunch of physicists move into alignment work. Heck, we’ve already seen it to a very limited extent with the ILLIAD conference itself. Turns out people working on the core problems want to talk to other people working on the core problems. But the process could perhaps be accelerated a lot with more dedicated venues.
I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess. I do think it takes an unusual skillset, though, which is where most of the trouble lives. I.e., I think the pre-paradigmatic skillset requires unusually strong epistemics (because you often need to track for yourself what makes sense), ~creativity (the ability to synthesize new concepts, to generate genuinely novel hypotheses/ideas), good ability to traverse levels of abstraction (connecting details to large level structure, this is especially important for the alignment problem), not being efficient market pilled (you have to believe that more is possible in order to aim for it), noticing confusion, and probably a lot more that I’m failing to name here.
Most importantly, though, I think it requires quite a lot of willingness to remain confused. Many scientists who accomplished great things (Darwin, Einstein) didn’t have publishable results on their main inquiry for years. Einstein, for instance, talks about wandering off for weeks in a state of “psychic tension” in his youth, it took ~ten years to go from his first inkling of relativity to special relativity, and he nearly gave up at many points (including the week before he figured it out). Figuring out knowledge at the edge of human understanding can just be… really fucking brutal. I feel like this is largely forgotten, or ignored, or just not understood. Partially that’s because in retrospect everything looks obvious, so it doesn’t seem like it could have been that hard, but partially it’s because almost no one tries to do this sort of work, so there aren’t societal structures erected around it, and hence little collective understanding of what it’s like.
Anyway, I suspect there are really strong selection pressures for who ends up doing this sort of thing, since a lot needs to go right: smart enough, creative enough, strong epistemics, independent, willing to spend years without legible output, exceptionally driven, and so on. Indeed, the last point seems important to me—many great scientists are obsessed. Spend night and day on it, it’s in their shower thoughts, can’t put it down kind of obsessed. And I suspect this sort of has to be true because something has to motivate them to go against every conceivable pressure (social, financial, psychological) and pursue the strange meaning anyway.
I don’t think the EA pipeline is much selecting for pre-paradigmatic scientists, but I don’t think lack of trying to get physicists to work on alignment is really the bottleneck either. Mostly I think selection effects are very strong, e.g., the Sequences was, imo, one of the more effective recruiting strategies for alignment. I don’t really know what to recommend here, but I think I would anti-recommend putting all the physics post-docs from good universities in a room in the hope that they make progress. Requesting that the world write another book as good as the Sequences is a… big ask, although to the extent it’s possible I expect it’ll go much further in drawing people out who will self select into this rather unusual “job.”
I think this is right. A couple of follow-on points:
There’s a funding problem if this is an important route to progress. If good work is illegible for years, it’s hard to decide who to fund, and hard to argue for people to fund it. I don’t have a proposed solution, but I wanted to note this large problem.
Einstein did his pre-paradigmatic work largely alone. Better collaboration might’ve sped it up.
LessWrong allows people to share their thoughts prior to having publishable journal articles and get at least a few people to engage.
This makes the difficult pre-paradigmatic thinking a group effort instead of a solo effort. This could speed up progress dramatically.
This post and the resulting comments and discussions is an example of the community collectively doing much of the work you describe: traversing levels, practicing good epistemics, and remaining confused.
Having conversations with other LWers (on calls, by DM, or in extended comment threads) is tremendously useful for me. I could produce those same thoughts and critiques, but it would take me longer to arrive at all of those different viewpoints of the issue. I mention this to encourage others to do it. Communication takes time and some additional effort (asking people to talk), but it’s often well worth it. Talking to people who are interested in and knowledgeable on the same topics can be an enormous speedup in doing difficult pre-paradigmatic thinking.
LessWrong isn’t perfect, but it’s a vast improvement on the collaboration tools and communities that have been available to scientists in other fields. We should take advantage of it.
Agreed. Simply focusing on physics post-docs feels too narrow to me.
Then again, just as John has a particular idea of what good alignment research looks like, I have my own idea: I would lean towards recruiting folk with both a technical and a philosophical background. It’s possible that my own idea is just as narrow.
The post did explicitly say “Obviously that doesn’t mean we exclusively want physics postdocs”.
Thanks for clarifying. Still feels narrow as a primary focus.
My guess is a roughly equally “central” problem is the incentive landscape around the OpenPhil/Anthropic school of thought
where you see Sam, I suspect something like “the lab memeplexes”. Lab superagents have instrumental convergent goals, and the instrumental convergent goals lead to instrumental, convergent beliefs, and also to instrumental blindspots
there are strong incentives for individual people to adjust their beliefs: money, social status, sense of importance via being close to the Ring
there are also incentives for people setting some of the incentives: funding something making progress on something seems more successful and easier than funding the dreaded theory
Good post, although I have some misgivings about how unpleasant it must be to read for some people.
One factor not mentioned here is the history of MIRI. MIRI was a pioneer in the field, and it was MIRI who articulated and promoted the agent foundations research agenda. The broad goals of agent foundations[1] are (IMO) load-bearing for any serious approach to AI alignment. But, when MIRI essentially declared defeat, in the minds of many that meant that any approach in that vein is doomed. Moreover, MIRI’s extreme pessimism deflates motivation and naturally produces the thought “if they are right then we’re doomed anyway, so might as well assume they are wrong”.
Now, I have a lot of respect for Yudkowsky and many of the people who worked at MIRI. Yudkowsky started it all, and MIRI made solid contributions to the field. I’m also indebted to MIRI for supporting me in the past. However, MIRI also suffered from some degree of echo-chamberism, founder-effect-bias, insufficient engagement with prior research (due to hubris), looking for nails instead of looking for hammers, and poor organization[2].
MIRI made important progress in agent foundations, but also missed an opportunity to do much more. And, while the AI game board is grim, their extreme pessimism is unwarranted overconfidence. Our understanding of AI and agency is poor: this is a strong reason to be pessimistic, but it’s also a reason to maintain some uncertainty about everything (including e.g. timelines).
Now, about what to do next. I agree that we need to have our own non-streetlighting community. In my book “non-streelighting” means mathematical theory plus empirical research that is theory-oriented: designed to test hypotheses made by theoreticians and produce data that best informs theoretical research (these are ~necessary but insufficient conditions for non-streetlighting). This community can and should engage with the rest of AI safety, but has to be sufficiently undiluted to have healthy memetics and cross-fertilization.
What does a community look like? It looks like our own organizations, conferences, discussion forums, training and recruitment pipelines, academia labs, maybe journals.
From my own experience, I agree that potential contributors should mostly have skills and knowledge on the level of PhD+. Highlighting physics might be a valid point: I have a strong background in physics myself. Physics teaches you a lot about connecting math to real-world problems, and is also in itself a test-ground for formal epistemology. However, I don’t think a background in physics is a necessary condition. At the very least, in my own research programme I have significant room for strong mathematicians that are good at making progress on approximately-concrete problems, even if they won’t contribute much on the more conceptual/philosophic level.
Which is, creating mathematical theory and tools for understanding agents.
I mostly didn’t feel comfortable talking about it in the past, because I was on MIRI’s payroll. This is not MIRI’s fault by any means: they never pressured me to avoid voicing opinions. It still feels unnerving to speak out against the people who write your paycheck.
(Prefatory disclaimer that, admittedly as an outsider to this field, I absolutely disagree with the labeling of prosaic AI work as useless streetlighting, for reasons building upon what many commenters wrote in response to the very posts you linked here as assumed background material. But in the spirit of your post, I shall ignore that moving forward.)
The “What to Do About It” section dances around but doesn’t explicitly name one of the core challenges of theoretical agent-foundations work that aims to solve the “hard bits” of the alignment challenge, namely the seeming lack of reliable feedback loops that give you some indication that you are pushing towards something practically useful in the end instead of just a bunch of cool math that nonetheless resides alone in its separate magisterium. As Conor Leahy concisely put it:
He was talking about philosophy in particular at that juncture, in response to Wei Dai’s concerns over metaphilosophical competence, but this point seems to me to generalize to a whole bunch of other areas as well. Indeed, I have talked about this before.
Do they get traction on “core hard problems” because of how Inherently Awesome they are as researchers, or do they do so because the types of physics problems we mostly care about currently are such that, while the generation of (worthwhile) grand mathematical theories is hard, verifying them is (comparatively) easier because we can run a bunch of experiments (or observe astronomical data etc., in the super-macro scale) to see if the answers they spit out comply with reality? I am aware of your general perspective on this matter, but I just… still completely disagree, for reasons other people have pointed out (see also Vanessa Kosoy’s comment here). Is this also supposed to be an implicitly assumed bit of background material?
And when we don’t have those verifying experiments at hand, do we not get stuff like string theory, where the math is beautiful and exquisite (in the domains it has been extended do) but debate by “physics postdocs” over whether it’s worthwhile to keep funding and pursuing it keeps raging on as a Theory of Everything keeps eliding our grasp? I’m sure people with more object-level expertise on this can correct my potential misconceptions if need be.
Idk man, some days I’m half-tempted to believe that all non-prosaic alignment work is a bunch of “streetlighting.” Yeah, it doesn’t result in the kind of flashy papers full of concrete examples about current models that typically get associated with the term-in-scare-quotes. But it sure seems to cover itself in a veneer of respectability by giving a (to me) entirely unjustified appearance of rigor and mathematical precision and robustness to claims about what will happen in the real world based on nothing more than a bunch of vibing about toy models that assume away the burdensome real-world details serving as evidence whether the approaches are even on the right track. A bunch of models that seem both woefully underpowered for the Wicked Problems they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs. The contents and success stories of Vanessa Kosoy’s desiderata, or of your own search for natural abstractions, or of Alex Altair’s essence of agent foundations, or of Orthogonal’s QACI, etc., seem entirely insensitive to the fact that we are currently dealing with multimodal LLMs combined with RL instead of some other paradigm, which in my mind almost surely disqualifies them as useful-in-the-real-world when the endgame hits.
There’s a famous Eliezer quote about how for every correct answer to a precisely-stated problem, there are a million times more wrong answers one could have given instead. I would build on that to say that for every powerfully predictive, but lossy and reductive mathematical model of a complex real-world system, there are a million times more similar-looking mathematical models that fail to capture the essence of the problem and ultimately don’t generalize well at all. And it’s only by grounding yourself to reality and hugging the query tight by engaging with real-world empirics that you can figure out if the approach you’ve chosen is in the former category as opposed to the latter.
(I’m briefly noting that I don’t fully endorse everything I said in the previous 2 paragraphs, and I realize that my framing is at least a bit confrontational and unfair. Separately, I acknowledge the existence of arguably-non-prosaic and mostly theoretical alignment approaches like davidad’s Open Agency Architecture, CHAI’s CIRL and utility uncertainty, Steve Byrnes’s work on brain-like AGI safety, etc., that don’t necessarily appear to fit this mold. I have varying opinions on the usefulness and viability of such approaches.)
I actually disagree with the natural abstractions research being ungrounded. Indeed, I think there is reason to believe that at least some of the natural abstractions work, especially the natural abstraction hypothesis actually sorts of holds true for today’s AI, and thus is the most likely out of the theoretical/agent-foundation approaches to work (I’m usually critical to agent foundations, but John Wentworth’s work is an exception that I’d like funding for).
For example, this post does an experiment that shows that OOD data still makes the Platonic Representation Hypothesis true, meaning that it’s likely that deeper factors are at play than just shallow similarity:
https://www.lesswrong.com/posts/Su2pg7iwBM55yjQdt/exploring-the-platonic-representation-hypothesis-beyond-in
I’m wary of a possible equivocation about what the “natural abstraction hypothesis” means here.
If we are referring to the redundant information hypothesis and various kinds of selection theorems, this is a mathematical framework that could end up being correct, is not at all ungrounded, and Wentworth sure seems like the man for the job.
But then you are still left with the task of grounding this framework in physical reality to allow you to make correct empirical predictions about and real-world interventions on what you will see from more advanced models. Our physical world abstracting well seems plausible (not necessarily >50% likely), and these abstractions being “natural” (e.g., in a category-theoretic sense) seems likely conditional on the first clause of this sentence being true, but I give an extremely low probability to the idea that these abstractions will be used by any given general intelligence or (more to the point) advanced AI model to a large and wide enough extent that retargeting the search is even close to possible.
And indeed, it is the latter question that represents the make-or-break moment for natural abstractions’ theory of change, for it is only when the model in front of you (as opposed to some other idealized model) uses these specific abstractions that you can look through the AI’s internal concepts and find your desired alignment target.
Rohin Shah has already explained the basic reasons why I believe the mesa-optimizer-type search probably won’t exist/be findable in the inner workings of the models we encounter: “Search is computationally inefficient relative to heuristics, and we’ll be selecting really hard on computational efficiency.” And indeed, when I look at the only general intelligences I have ever encountered in my entire existence thus far, namely humans, I see mostly just a kludge of impulses and heuristics that depend very strongly (almost entirely) on our specific architectural make-up and the contextual feedback we encounter in our path through life. Change either of those and the end result shifts massively.
And even moving beyond that, is the concept of the number “three” a natural abstraction? Then I see entire collections and societies of (generally intelligent) human beings today who don’t adopt it. Are the notions of “pressure” and “temperature” and “entropy” natural abstractions? I look at all human beings in 1600 and note that not a single one of them had ever correctly conceptualized a formal version of any of those; and indeed, even making a conservative estimate of the human species (with an essentially unchanged modern cognitive architecture) having existed for 200k years, this means that for 99.8% of our species’ history, we had no understanding whatsoever of concepts as “universal” and “natural” as that. If you look at subatomic particles like electrons or stuff in quantum mechanics, the percentage manages to get even higher. And that’s only conditioning on abstractions about the outside world that we have eventually managed to figure out; what about the other unknown unknowns?
I don’t think it shows that at all, since I have not been able to find any analysis of the methodology, data generation, discussion of results, etc. With no disrespect to the author (who surely wasn’t intending for his post to be taken as authoritative as a full paper in terms of updating towards his claim), this is shoddy science, or rather not science at all, just a context-free correlation matrix.
Anyway, all this is probably more fit for a longer discussion at some point.
I think this statement is quite ironic in retrospect, given how OpenAI’s o-series seems to work (at train-time and at inference-time both), and how much AI researchers hype it up.
By contrast, my understanding is that the sort of search John is talking about retargeting isn’t the brute-force babble-and-prune algorithms, but a top-down heuristical-constraint-based search.
So it is in fact the ML researchers now who believe in the superiority of the computationally inefficient search; not the agency theorists.
Re the OpenAI o-series and search, my initial prediction is that Q*/MCTS search will work well on problems that are easy to verify and and easy to get training data for, and not work if either of these 2 conditions are violated, and secondarily will be reliant on the model having good error correction capabilities to use the search effectively, which is why I expect we can make RL capable of superhuman performance on mathematics/programming with some rather moderate schlep/drudge work, and I also expect cost reductions such that it can actually be practical, but I’m only giving a 50⁄50 chance by 2028 for superhuman performance as measured by benchmarks in these domains.
I think my main difference from you, Thane Ruthenis is I expect costs to reduce surprisingly rapidly, though this is admittedly untested.
This will accelerate AI progress, but not immediately cause an AI explosion, though in the more extreme paces this could create something like a scenario where programming companies are founded by a few people smartly managing a lot of programming AIs, and programming/mathematics experiencing something like what happened to the news industry from the rise of the internet, where there was a lot of bankruptcy of the middle end, the top end won big, and most people are in the bottom end.
Also, correct point on how a lot of people’s conceptions of search are babble-and-prune, not top down search like MCTS/Q*/BFS/DFS/A* (not specifically targeted at sunwillrisee
I’m not strongly committed to the view that the costs won’t rapidly reduce: I can certainly see the worlds in which it’s possible to efficiently distill trees-of-thought unrolls into single chains of thoughts. Perhaps it scales iteratively, where we train a ML model to handle the next layer of complexity by generating big ToTs, distilling them into CoTs, then generating the next layer of ToTs using these more-competent CoTs, etc.
Or perhaps distillation doesn’t work that well, and the training/inference costs grow exponentially (combinatorially?).
Yeah, we will have to wait at least several years.
One confound in all of this is that big talent is moving out of OpenAI, which means I’m more bearish on the company’s future prospects specifically without it being that much of a detriment towards progress towards AGI.
Yeah, it hasn’t been shown that these abstractions can ultimately be retargeted by default for today’s AI.
Even if I’d agree with your conclusion, your argument seems quite incorrect to me.
That’s what math always is. The applicability of any math depends on how well the mathematical models reflect the situation involved.
It seems very unlikely to me that you’d have many ‘similar-looking mathematical models’. If a class of real-world situations seems to be abstracted in multiple ways such that you have hundreds (not even millions) of mathematical models that supposedly could capture its essence, maybe you are making a mistake somewhere in your modelling. Abstract away the variations. From my experience, you may have a small bunch of mathematical models that could likely capture the essence of the class of real-world situations, and you may debate with your friends about which one is more appropriate, but you will not have ‘multiple similar-looking models’.
Nevertheless, I agree with your general sentiment. I feel like humans will find it quite difficult make research progress without concrete feedback loops, and that actually trying stuff with existing examples of models (that is, the stuff that Anthropic and Apollo are doing, for example) provide valuable data points.
I also recommend maybe not spending so much time reading LessWrong and instead reading STEM textbooks.
(I haven’t read your comments you link, so apologies if you’ve already responded to this point before).
I can’t speak to most of these simply out of lack of deep familiarity, but I don’t think natural abstractions is disqualified at all by this.
What do we actually want out of interpretability? I don’t think mechanistic interpretability, as it stands currently, gives us explanations of the form we actually want. For example, what are a model’s goals? Is it being deceptive? To get answers to those questions, you want to first know what those properties actually look like—you can’t get away with identifying activations corresponding to how to deceive humans, because those could relate to a great number of things (e.g. modelling other deceptive agents). Composability is a very non-trivial problem.
If you want to answer those questions, you need to find a way to get better measures of whatever property you want to understand. This is the central idea behind Eliciting Latent Knowledge and other work that aims for unsupervised honesty (where the property is honesty), what I call high-level interpretability of inner search / objectives, etc.
Natural abstractions is more agnostic about what kinds of properties we would care about, and tries to identify universal building blocks for any high-level property like this. I am much more optimistic about picking a property and going with it, and I think this makes the problem easier, but that seems like a different disagreement than yours considering both are inevitably somewhat conceptual and require more prescriptive work than work focusing solely on frontier models.
If you wanted to get good handles to steer your model at all, you’re going to have to do something like figuring out the nature of the properties you care about. You can definitely make that probem easier by focusing on how those properties instantiate in specific classes of systems like LLMs or neural nets (and I do in my work), but you still have to deal with a similar version of the problem in the end. John is sceptical enough of this paradigm being the one that leads us to AGI that he doesn’t want to bet heavily on his work only being relevant if that turns out to be true, which I think is pretty reasonable.
(These next few sentences aren’t targeted at you in particular). I often see claims made of the form: “[any work that doesn’t look like working directly with LLMs] hasn’t updated on the fact that LLMs happened”. Sometimes that’s true! But very commonly I also see the claim made without understanding what that work is actually trying to do, or what kind of work we would need to reliably align / interpret super-intelligent LLMs-with-RL. I don’t know whether it’s true of the other agent foundations work you link to, but I definitely don’t think natural abstractions hasn’t updated on LLMs being the current paradigm.
Agreed that this is a plausible explanation of what’s going on. I think that the bottleneck on working on good directions in alignment is different though, so I don’t think the analogy carries over very well. I think that reliable feedback loops are very important in alignment research as well to be clear, I just don’t think the connection to physicists routes through that.
Epistemic status: This is a work of satire. I mean it—it is a mean-spirited and unfair assessment of the situation. It is also how, some days, I sincerely feel.
A minivan is driving down a mountain road, headed towards a cliff’s edge with no guardrails. The driver floors the accelerator.
Passenger 1: “Perhaps we should slow down somewhat.”
Passengers 2, 3, 4: “Yeah, that seems sensible.”
Driver: “No can do. We’re about to be late to the wedding.”
Passenger 2: “Since the driver won’t slow down, I should work on building rocket boosters so that (when we inevitably go flying off the cliff edge) the van can fly us to the wedding instead.”
Passenger 3: “That seems expensive.”
Passenger 2: “No worries, I’ve hooked up some funding from Acceleration Capital. With a few hours of tinkering we should get it done.”
Passenger 1: “Hey, doesn’t Acceleration Capital just want vehicles to accelerate, without regard to safety?”
Passenger 2: “Sure, but we’ll steer the funding such that the money goes to building safe and controllable rocket boosters.”
The van doesn’t slow down. The cliff looks closer now.
Passenger 3: [looking at what Passenger 2 is building] “Uh, haven’t you just made a faster engine?”
Passenger 2: “Don’t worry, the engine is part of the fundamental technical knowledge we’ll need to build the rockets. Also, the grant I got was for building motors, so we kinda have to build one.”
Driver: “Awesome, we’re gonna get to the wedding even sooner!” [Grabs the engine and installs it. The van speeds up.]
Passenger 1: “We’re even less safe now!”
Passenger 3: “I’m going to start thinking about ways to manipulate the laws of physics such that (when we inevitably go flying off the cliff edge) I can manage to land us safely in the ocean.”
Passenger 4: “That seems theoretical and intractable. I’m going to study the engine to figure out just how it’s accelerating at such a frightening rate. If we understand the inner workings of the engine, we should be able to build a better engine that is more responsive to steering, therefore saving us from the cliff.”
Passenger 1: “Uh, good luck with that, I guess?”
Nothing changes. The cliff is looming.
Passenger 1: “We’re gonna die if we don’t stop accelerating!”
Passenger 2: “I’m gonna finish the rockets after a few more iterations of making engines. Promise.”
Passenger 3: “I think I have a general theory of relativity as it relates to the van worked out...”
Passenger 4: “If we adjust the gear ratio… Maybe add a smart accelerometer?”
Driver: “Look, we can discuss the benefits and detriments of acceleration over hors d’oeuvres at the wedding, okay?”
unfortunately, the disanalogy is that any driver who moves their foot towards the brakes is almost instantly replaced with one who won’t.
Driver: My map doesn’t show any cliffs
Passenger 1: Have you turned on the terrain map? Mine shows a sharp turn next to a steep drop coming up in about a mile
Passenger 5: Guys maybe we should look out the windshield instead of down at our maps?
Driver: No, passenger 1, see on your map that’s an alternate route, the route we’re on doesn’t show any cliffs.
Passenger 1: You don’t have it set to show terrain.
Passenger 6: I’m on the phone with the governor now, we’re talking about what it would take to set a 5 mile per hour national speed limit.
Passenger 7: Don’t you live in a different state?
Passenger 5: The road seems to be going up into the mountains, though all the curves I can see from here are gentle and smooth.
Driver and all passengers in unison: Shut up passenger 5, we’re trying to figure out if we’re going to fall off a cliff here, and if so what we should do about it.
Passenger 7: Anyway, I think what we really need to do to ensure our safety is to outlaw automobiles entirely.
Passenger 3: The highest point on Earth is 8849m above sea level, and the lowest point is 430 meters below sea level, so the cliff in front of us could be as high as 9279m.
My view of the development of the field of AI alignment is pretty much the exact opposite of yours: theoretical agent foundations research, what you describe as research on the hard parts of the alignment problem, is a castle in the clouds. Only when alignment researchers started experimenting with real-world machine learning models did AI alignment become grounded in reality. The biggest epistemic failure in the history of the AI alignment community was waiting too long to make this transition.
Early arguments for the possibility of AI existential risk (as seen, for example, in the Sequences) were largely based on 1) rough analogies, especially to evolution, and 2) simplifying assumptions about the structure and properties of AGI. For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.
As neural networks started to see increasing success at a wide variety of problems in the mid-2010s, it started to become apparent that the analogies and assumptions behind early AI x-risk cases didn’t apply to them. The process of developing an ML model isn’t very similar to evolution. Neural networks use finite amounts of compute, have internals that can be probed and manipulated, and behave in ways that can’t be rounded off to decision theory. On top of that, it became increasingly clear as the deep learning revolution progressed that even if agent foundations research did deliver accurate theoretical results, there was no way to put them into practice.
But many AI alignment researchers stuck with the agent foundations approach for a long time after their predictions about the structure and behavior of AI failed to come true. Indeed, the late-2000s AI x-risk arguments still get repeated sometimes, like in List of Lethalities. It’s telling that the OP uses worst-case ELK as an example of one of the hard parts of the alignment problem; the framing of the worst-case ELK problem doesn’t make any attempt to ground the problem in the properties of any AI system that could plausibly exist in the real world, and instead explicitly rejects any such grounding as not being truly worst-case.
Why have ungrounded agent foundations assumptions stuck around for so long? There are a couple factors that are likely at work:
Agent foundations nerd-snipes people. Theoretical agent foundations is fun to speculate about, especially for newcomers or casual followers of the field, in a way that experimental AI alignment isn’t. There’s much more drudgery involved in running an experiment. This is why I, personally, took longer than I should have to abandon the agent foundations approach.
Game-theoretic arguments are what motivated many researchers to take the AI alignment problem seriously in the first place. The sunk cost fallacy then comes into play: if you stop believing that game-theoretic arguments for AI x-risk are accurate, you might conclude that all the time you spent researching AI alignment was wasted.
Rather than being an instance of the streetlight effect, the shift to experimental research on AI alignment was an appropriate response to developments in the field of AI as it left the GOFAI era. AI alignment research is now much more grounded in the real world than it was in the early 2010s.
You do realize that by “alignment”, the OP (John) is not taking about techniques that prevent an AI that is less generally capable than a capable person from insulting the user or expressing racist sentiments?
We seek a methodology for constructing an AI that either ensures that the AI turns out not to be able to easily outsmart us or (if it does turn out to be able to easily outsmart us) ensures (or makes it unlikely) that it won’t kill us all or do something other terrible thing. (The former is not researched much compared to the latter, but I felt the need to include it for completeness.)
The way it is now, it is not even clear whether you and the OP (John) are talking about the same thing (because “alignment” has come to have a broad meaning).
If you want to continue the conversation, it would help to know whether you see a pressing need for a methodology of the type I describe above. (Many AI researchers do not: they think that outcomes like human extinction are quite unlikely or at least easy to avoid.)
I am a physics PhD student. I study field theory. I have a list of projects I’ve thrown myself at with inadequate technical background (to start with) and figured out. I’ve convinced a bunch of people at a research institute that they should keep giving me money to solve physics problems. I’ve been following LessWrong with interest for years. I think that AI is going to kill us all, and would prefer to live for longer if I can pull it off. So what do I do to see if I have anything to contribute to alignment research? Maybe I’m flattering myself here, but I sound like I might be a person of interest for people who care about the pipeline. I don’t feel like a great candidate because I don’t have any concrete ideas for AI research topics to chase down, but it sure seems like I might start having ideas if I worked on the problem with somebody for a bit. I’m apparently very ok with being an underpaid gopher to someone with grand theoretical ambitions while I learn the material necessary to come up with my own ideas. My only lead to go on is “go look for something interesting in MATS and apply to it” but that sounds like a great way to end up doing streetlight research because I don’t understand the field. Ideally, I guess I would have whatever spark makes people dive into technical research in a pretty low-status field for no money for long enough to produce good enough research which convinces people to pay their rent while they keep doing more, but apparently the field can’t find enough of those that it’s unwilling to look for other options.
I know what to do to keep doing physics research. My TA assignment effectively means that I have a part-time job teaching teenagers how to use Newton’s laws so I can spend twenty or thirty hours a week coding up quark models. I did well on a bunch of exams to convince an institution that I am capable of the technical work required to do research (and, to be fair, I provide them with 15 hours per week of below-market-rate intellectual labor which they can leverage into tuition that more than pays my salary), so now I have a lot of flexibility to just drift around learning about physics I find interesting while they pay my rent. If someone else is willing to throw 30,000 dollars per year at me to think deeply about AI and get nowhere instead of thinking deeply about field theory to get nowhere, I am not aware of them. Obviously the incentives are perverse to just go around throwing money at people who might be good at AI research, so I’m not surprised that I’ve only found one potential money spigot for AI research, but I had so many to choose from for physics.
It sounds like you should apply for the PIBBSS Fellowship! (https://pibbss.ai/fellowship/)
Going to MATS is also an opportunity to learn a lot more about the space of AI safety research, e.g. considering the arguments for different research directions and learning about different opportunities to contribute. Even if the “streetlight research” project you do is kind of useless (entirely possible), doing MATS is plausibly a pretty good option.
MATS will push you to streetlight much more unless you have some special ability to have it not do that.
Do you mean during the program? Sure, maybe the only MATS offers you can get are for projects you think aren’t useful—I think some MATS projects are pretty useless (e.g. our dear OP’s). But it’s still an opportunity to argue with other people about the problems in the field and see whether anyone has good justifications for their prioritization. And you can stop doing the streetlight stuff afterwards if you want to.
Remember that the top-level commenter here is currently a physicist, so it’s not like the usefulness of their work would be going down by doing a useless MATS project :P
Yes it would! It would eat up motivation and energy and hope that they could have put towards actual research. And it would put them in a social context where they are pressured to orient themselves toward streetlighty research—not just during the program, but also afterward. Unless they have some special ability to have it not do that.
Without MATS: not currently doing anything directly useful (though maybe indirectly useful, e.g. gaining problem-solving skill). Could, if given $30k/year, start doing real AGI alignment thinking from scratch not from scratch, thereby scratching their “will you think in a way that unlocks understanding of strong minds” lottery ticket that each person gets.
With MATS: gotta apply to extension, write my LTFF grant. Which org should I apply to? Should I do linear probes software engineering? Or evals? Red teaming? CoT? Constitution? Hyperparamter gippity? Honeypot? Scaling supervision? Superalign, better than regular align? Detecting deception?
Obviously I disagree with Tsvi regarding the value of MATS to the proto-alignment researcher; I think being exposed to high quality mentorship and peer-sourced red-teaming of your research ideas is incredibly valuable for emerging researchers. However, he makes a good point: ideally, scholars shouldn’t feel pushed to write highly competitive LTFF grant applications so soon into their research careers; there should be longer-term unconditional funding opportunities. I would love to unlock this so that a subset of scholars can explore diverse research directions for 1-2 years without 6-month grant timelines looming over them. Currently cooking something in this space.
The first step would probably be to avoid letting the existing field influence you too much. Instead, consider from scratch what the problems of minds and AI are, how they relate to reality and to other problems, and try to grab them with intellectual tools you’re familiar with. Talk to other physicists and try to get into exploratory conversation that does not rely on existing knowledge. If you look at the existing field, look at it like you’re studying aliens anthropologically.
[was a manager at MATS until recently and want to flesh out the thing Buck said a bit more]
It’s common for researchers to switch subfields, and extremely common for MATS scholars to get work doing something different from what they did at MATS. (Kosoy has had scholars go on to ARC, Neel scholars have ended up in scalable oversight, Evan’s scholars have a massive spread in their trajectories; there are many more examples but it’s 3 AM.)
Also I wouldn’t advise applying to something that seems interesting; I’d advise applying for literally everything (unless you know for sure you don’t want to work with Neel, since his app is very time intensive). The acceptance rate is ~4 percent, so better to maximize your odds (again, for most scholars, the bulk of the value is not in their specific research output over the 10 week period, but in having the experience at all).
Also please see Ryan’s replies to Tsvi on the talent needs report for more notes on the street lighting concern as it pertains to MATS. There’s a pretty big back and forth there (I don’t cleanly agree with one side or the other, but it might be useful to you).
You could consider doing MATS as “I don’t know what to do, so I’ll try my hand at something a decent number of apparent experts consider worthwhile and meanwhile bootstrap a deep understanding of this subfield and a shallow understanding of a dozen other subfields pursued by my peers.” This seems like a common MATS experience and I think this is a good thing.
I am surprised that you find theoretical physics research less tight funding-wise than AI alignment [is this because the paths to funding in physics are well-worn, rather than better resourced?].
This whole post was a little discouraging. I hope that the research community can find a way forward.
If you’re mobile (able to be in the UK) and willing to try a different lifestyle, consider going to the EA hotel aka CEEALAR, they offer free food and accomodation for a bunch of people, including many people working on AI safety. Alternatively, taking a quick look at https://www.aisafety.com/funders, the current best options are maybe LTFF, OpenPhil, CLR, or maybe AE Studios?
Cf. https://www.lesswrong.com/posts/QzQQvGJYDeaDE4Cfg/talent-needs-of-technical-ai-safety-teams?commentId=BNkpTqwcgMjLhiC8L
https://www.lesswrong.com/posts/unCG3rhyMJpGJpoLd/koan-divining-alien-datastructures-from-ram-activations?commentId=apD6dek5zmjaqeoGD
https://www.lesswrong.com/posts/HbkNAyAoa4gCnuzwa/wei-dai-s-shortform?commentId=uMaQvtXErEqc67yLj
Thank you for writing this post. I’m probably slightly more optimistic than you on some of the streetlighting approaches, but I’ve also been extremely frustrated that we don’t have anything better, when we could.
I’ve seen discussions from people who I vehemently disagreed that did similar things and felt very frustrated by not being able to defend my views with greater bandwidth. This isn’t a criticism of this post—I think a non-zero number of those are plausibly good—but: I’d be happy to talk at length with anyone who feels like this post is unfair to them, about our respective views. I likely can’t do as good a job as John can (not least because our models aren’t identical), but I probably have more energy for talking to alignment researchers[1].
I disagree on two counts. I think people simply not thinking about what it would take to make superintelligent AI go well is a much, much bigger and more common cause for failure than the others, including flinching away. Getting traction on hard problems would solve the problem only if there weren’t even easier-traction (or more interesting) problems that don’t help. Very anecdotally, I’ve talked to some extremely smart people who I would guess are very good at making progress on hard problems, but just didn’t think too hard about what solutions help.
I think the skills to do that may be correlated with physics PhDs, but more weakly. I don’t think recruiting smart undergrads was a big mistake for that reason. Then again, I only have weak guesses as to what things you should actually select for such that you get people with these skills—there’s still definitely failure modes like people who find the hard problems, and aren’t very good at making traction on them (or people who overshoot on finding the hard problem and work on something nebuluously hard).
My guess would be that a larger source of “what went wrong” follows from incentives like “labs doing very prosaic alignment / interpretability / engineering-heavy work” → “selecting for people who are very good engineers or the like” → “selects for people who can make immediate progress on hand-made problems without having to spend a lot of time thinking about what broad directions to work on or where locally interesting research problems are not-great for superintelligent AI”.
In the past I’ve done this much more adversarially than I’d have liked, so if you’re someone who was annoyed at having such a conversation with me before—I promise I’m trying to be better about that.
I don’t want to respond to the examples rather than the underlying argument, but it seems necessary here: this seems like a massively overconfident claim about ELK and debate that I don’t believe is justified by popular theoretical worst-case objections. I think a common cause of “worlds where iterative design fails” is “iterative design seems hard and we stopped at the first apparent hurdle.” Sure, in some worlds we can rule out entire classes of solutions via strong theoretical arguments (e.g., “no-go theorems”); but that is not the case here. If I felt confident that the theory-level objections to ELK and debate ruled out hodge-podge solutions, I would abandon hope in these research directions and drop them from the portfolio. But this “flinching away” would ensure that genuine progress on these thorny problems never happened. If Stephen Casper, et al. treated ELK as unsolvable, they would never have published “Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs”. If Akbir Khan, et al. treated debate as unsolvable, they would never have published “Debating with More Persuasive LLMs Leads to More Truthful Answers.” I consider these papers genuine progress towards hodge-podge alignment, which I consider a viable strategy towards “alignment MVPs.” Many more such cases.
A crucial part of my worldview is that good science does not have to “begin with the entire end in mind.” Exploratory research pushing the boundaries of the current paradigm might sometimes seem contraindicated by theoretical arguments, but achieve empirical results anyways. The theory-practise gap can cut both ways: sometimes sound-seeming theoretical arguments fail to predict tractable empirical solutions. Just because travelling salesman is NP-hard doesn’t mean that a P-time algorithm for a restricted subclass of the problem doesn’t exist! I do not feel sufficiently confident in the popular criticisms of dominant prosaic AI safety strategies to rule them out; far from it. I want new researchers to download these criticisms, as there is valuable insight here, but “touch reality” anyways, rather than flinch away from the messy work of iteratively probing and steering vast, alien matrices.
Maybe research directions like ELK and debate sizzle out because the theoretical objections hold up in practise. Maybe they sizzle out for unrelated reasons, such as weird features of neural nets or just because we don’t have time for the requisite empirical tinkering. But we would never find out unless we tried! And we would never equip a generation of empirical researchers to iterate on candidate solutions until something breaks. It’s this skill that I think is most lacking in alignment: turning theoretical builder-breaker moves into empirical experiments on frontier LLMs (the apparent most likely substrate of first-gen AGI) that move us iteratively closer to a sufficiently scalable hodge-podge alignment MVP.
Yeah, the worst-case ELK problem could well have no solution, but in practice alignment is solvable either by other methods or by having an ELK solution that does work on a large classes of AIs like neural nets, so Alice is plausibly making a big mistake, and a crux here is that I don’t believe we will ever get no-go theorems, or even arguments to the standard level of rigor in physics because I believe alignment has pretty lax constraints, so a lot of solutions can appear.
The relevant sentence below:
Some caveats:
A crucial part of the “hodge-podge alignment feedback loop” is “propose new candidate solutions, often grounded in theoretical models.” I don’t want to entirely focus on empirically fleshing out existing research directions to the exclusion of proposing new candidate directions. However, it seems that, often, new on-paradigm research directions emerge in the process of iterating on old ones!
“Playing theoretical builder-breaker” is an important skill and I think this should be taught more widely. “Iterators,” as I conceive of them, are capable of playing this game well, in addition to empirically testing these theoretical insights against reality. John, to his credit, did a great job of emphasizing the importance of this skill with his MATS workshops on the alignment game tree and similar.
I don’t want to entirely trust in alignments MVPs, so I strongly support empirical research that aims to show the failure modes of this meta-strategy. I additionally support the creation of novel strategic paradigms, though I think this is quite hard. IMO, our best paradigm-level insights as a field largely come from interdisciplinary knowledge transfer (e.g., from economics, game theory, evolutionary biology, physics), not raw-g ideas from the best physics postdocs. Though I wouldn’t turn away a chance to create more von Neumann’s, of course!
Robin Hanson recently wrote about two dynamics that can emerge among individuals within an organisations when working as a group to reach decisions. These are the “outcome game” and the “consensus game.”
In the outcome game, individuals aim to be seen as advocating for decisions that are later proven correct. In contrast, the consensus game focuses on advocating for decisions that are most immediately popular within the organization. When most participants play the consensus game, the quality of decision-making suffers.
The incentive structure within an organization influences which game people play. When feedback on decisions is immediate and substantial, individuals are more likely to engage in the outcome game. Hanson argues that capitalism’s key strength is its ability to make outcome games more relevant.
However, if an organization is insulated from the consequences of its decisions or feedback is delayed, playing the consensus game becomes the best strategy for gaining resources and influence.
This dynamic is particularly relevant in the field of (existential) AI Safety, which needs to develop strategies to mitigate risks before AGI is developed. Currently, we have zero concrete feedback about which strategies can effectively align complex systems of equal or greater intelligence to humans.
As a result, it is unsurprising that most alignment efforts avoid tackling seemingly intractable problems. The incentive structures in the field encourage individuals to play the consensus game instead.
Actually, I now suspect this is to a significant extent disinformation. You can tell when ideas make sense if you think hard about them. There’s plenty of feedback, that’s not already being taken advantage of, at the level of “abstract, high-level, philosophy of mind”, about the questions of alignment.
Thanks for linking this post. I think it has a nice harmony with Prestige vs Dominance status games.
I agree that this is a dynamic that is strongly shaping AI Safety, but would specify that it’s inherited from the non-profit space in general—EA originated with the claim that it could do outcome focused altruism, but.. there’s still a lot of room for improvement, and I’m not even sure we’re improving.
The underlying dynamics and feedback loops are working against us, and I don’t see evidence that core EA funders/orgs are doing more than pay lip service to this problem.
Yep. This post is not for me but I’ll say a thing that annoyed me anyway:
Does this actually happen? (Even if you want to be maximally cynical, I claim presenting novel important difficulties (e.g. “sensor tampering”) or giving novel arguments that problems are difficult is socially rewarded.)
Yes, absolutely. Five years ago, people were more honest about it, saying ~explicitly and out loud “ah, the real problems are too difficult; and I must eat and have friends; so I will work on something else, and see if I can get funding on the basis that it’s vaguely related to AI and safety”.
To the extent that anecdata is meaningful:
I have met somewhere between 100-200 AI Safety people in the past ~2 years; people for whom AI Safety is their ‘main thing’.
The vast majority of them are doing tractable/legible/comfortable things. Most are surprisingly naive; have less awareness of the space than I do (and I’m just a generalist lurker who finds this stuff interesting; not actively working on the problem).
Few are actually staring into the void of the hard problems; where hard here is loosely defined as ‘unknown unknowns, here be dragons, where do I even start’.
Fewer still progress from staring into the void to actually trying things.
I think some amount of this is natural and to be expected; I think even in an ideal world we probably still have a similar breakdown—a majority who aren’t contributing (yet)[1], a minority who are—and I think the difference is more in the size of those groups.
I think it’s reasonable to aim for a larger, higher quality, minority; I think it’s tractable to achieve progress through mindfully shaping the funding landscape.
Think it’s worth mentioning that all newbies are useless, and not all newbies remain newbies. Some portion of the majority are actually people who will progress to being useful after they’ve gained experience and wisdom.
This isn’t clear to me, where the crux (though maybe it shouldn’t be) is “is it feasible for any substantial funders to distinguish actually-trying research from other”.
Yeah, I agree sometimes people decide to work on problems largely because they’re tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I’m unconvinced of the flinching away or dishonest characterization.
Do you think that funders are aware that >90% [citation needed!] of the money they give to people, to do work described as helping with “how to make world-as-we-know-it ending AGI without it killing everyone”, is going to people who don’t even themselves seriously claim to be doing research that would plausibly help with that goal? If they are aware of that, why would they do that? If they aren’t aware of it, don’t you think that it should at least be among your very top hypotheses, that those researchers are behaving materially deceptively, one way or another, call it what you will?
I do not.
On the contrary, I think ~all of the “alignment researchers” I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don’t know are likely substantially worse but not a ton.)
In particular I think all of the alignment-orgs-I’m-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.
This doesn’t feel odd: these people are smart and actually care about the big problem; if their work was in the even if this succeeds it obviously wouldn’t be helpful category they’d want to know (and, given the “obviously,” would figure that out).
Possibly the situation is very different in academia or MATS-land; for now I’m just talking about the people around me.
I wonder whether John believes that well-liked research, e.g. Fabien’s list, is actually not valuable or rare exceptions coming from a small subset of the “alignment research” field.
I strongly suspect he thinks most of it is not valuable
I feel like John’s view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I’m pretty sure that’s false.) I assume John doesn’t believe that, and I wonder why he doesn’t think his view entails it.
From the post:
Yeah. I agree/concede that you can explain why you can’t convince people that their own work is useless. But if you’re positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.
The flinches aren’t structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.
As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it’s impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise—that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, clamping “can an AI be very superhumanly capable” to “no”. That clamping causes them to also not see the flaws in the plan “we’ll deploy our AIs in a staged manner, see how they behave, and then recall them if they behave poorly”, because they don’t think RSI is feasible, they don’t think extreme persuasion is feasible, etc.
A more real example is, say, people thinking of “structures for decision making”, e.g. constitutions. You explain that these things, they are not reflectively stable. And now this person can’t understand reflective stability in general, so they don’t understand why steering vectors won’t work, or why lesioning won’t work, etc.
Another real but perhaps more controversial example: {detecting deception, retargeting the search, CoT monitoring, lesioning bad thoughts, basically anything using RL} all fail because creativity starts with illegible concomitants to legible reasoning.
(This post seems to be somewhat illegible, but if anyone wants to see more real examples of aspects of mind that people fail to remember, see https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html)
Here’s a Facebook post by Yann LeCun from 2017 which has a similar message to this post and seems quite insightful:
He describes how engineering artifacts often precede theoretical understanding and that deep learning worked empirically for a long time before we began to understand it theoretically. He says that researchers ignored deep learning because it didn’t fit into their existing models of how learning should work.
I think the high-level lesson from the Facebook post is that street-lighting occurs when we try to force reality to be understood in terms of our existing models of how it should work (incorrect models like phlogiston are common in the history of science). Though this LessWrong post argues that street-lighting occurs when researchers have a bias towards working on easier problems.
Instead a better approach is to allow reality and evidence to dictate how we create our models of the world even if those more correct models are more complex or require major departures from existing models (which creates a temptation to ‘flinch away’). I think a prime example of this is quantum mechanics: my understanding of how it was developed was that physicists noticed bizarre results from experiments like the double-split experiment and developed new theories (e.g. wave-particle duality) that described reality well even if they were counterintuitive or novel.
I guess the modern equivalent that’s relevant to AI alignment would be Singular Learning Theory which proposes a novel theory to explain how deep learning generalizes.
hi John,
Let’s talk about a hypothetical alignment researcher who, (for no particular reason), we’ll call John. This researcher wants to critique the alignment field, but also needs a way to acknowledge his own work isn’t empirically grounded. Maybe he needs the acknowledgment as a philosophical fig leaf, or maybe he’s honestly concerned but not very good at seeing the plank in his own eye. Either way, he meets with two approaches to writing critiques—let’s call them (again for no particular reason) “careful charitable engagement” and “provocative dismissal.” Both can be effective, but they have very different impacts on community discourse. It turns out that careful engagement requires actually understanding what other researchers are doing, while provocative dismissal lets you write spicy posts from your armchair. Lo and behold, John endorses provocative dismissal as his Official Blogging Strategy, and continues writing critiques. (Actually the version which eventually emerges isn’t even fully researched, it’s a quite different version which just-so-happens to be even more dismissive of work he hasn’t closely followed.)
Glad to hear you enjoyed ILIAD.
Best,
AGO
I think I do agree with some points in this post. This failure mode is the same as the one I mentioned about why people are doing interpretability for instance (section Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high), and I do think that this generalizes somewhat to whole field of alignment. But I’m highly skeptical that recruiting a bunch of physicists to work on alignment would be that productive:
Empirically, we’ve already kind of tested this, and it doesn’t work.
I don’t think that what Scott Aaronson produced while at OpenAI had really helped AI Safety: He is exactly doing what is criticized in the post: Streetlight research and using techniques that he was already familiar with from his previous field of research, I don’t think the author of the OP would disagree with me. Maybe n=1, but it was one of the most promising shots.
Two years ago, I was doing field-building and trying to source talent, primarily selecting based on pure intellect and raw IQ. I’ve organized the Von Neumann Symposium around the problem of corrigibility, I targeted IMO laureates, and individuals from the best school in France, ENS Ulm, which arguably has the highest concentration of future Nobel laureates in the world. However, pure intelligence doesn’t work. In the long term, the individuals who succeeded in the field weren’t the valedictorians from France’s top school, but rather those who were motivated, had read The Sequences, were EA people, possessed good epistemology, and had a willingness to share their work online (maybe you are going to say that the people I was targeting were too young, but I think my little empirical experience is already much better than the speculation in the OP).
My prediction is that if you put a group of skilled physicists in a room, first, it’s not even sure they would find that many people motivated in this reference class, and I don’t think the few who would be motivated would produce good-quality work.
For the ML4Good bootcamps, the scoring system reflects this insight. We use multiple indicators and don’t rely solely on pure IQ to select participants, because there is little correlation between pure high IQ and long term quality production.
I believe the biggest mistake in the field is trying to solve “Alignment” rather than focusing on reducing catastrophic AI risks. Alignment is a confused paradigm; it’s a conflationary alliance term that has sedimented over the years. It’s often unclear what people mean when they talk about it: Safety isn’t safety without a social model.
Think about what has been most productive in reducing AI risks so far? My short list would be:
The proposed SB 1047 legislation.
The short statement on AI risks
Frontier AI Safety Commitments, AI Seoul Summit 2024, to encourage labs to publish their responsible scaling policies.
Scary demonstrations to showcase toy models of deception, fake alignment, etc, and to create more scientific consensus, which is very very needed
As a result, the field of “Risk Management” is more fundamental for reducing AI risks than “AI Alignment.” In my view, the theoretical parts of the alignment field have contributed far less to reducing existential risks than the responsible scaling policies or the draft of the EU AI Act’s Code of Practice for General Purpose AI Systems, which is currently not too far from being the state-of-the-art for AI risk management. Obviously, it’s still incomplete, but that’s the direction that is I think most productive today.
Related, The Swiss cheese model of safety is underappreciated in the field. This model has worked across other industries and seems to be what works for the only general intelligence we know: humans. Humans use a mixture of strategies for safety we could imitate for AI safety (see this draft). However, the agent foundations community seems to be completely neglecting this.
I think this lens of incentives and the “flinching away” concept are extremely valuable for understanding the field of alignment (and less importantly, everything else:).
I believe “flinching away” is the psychological tendency that creates bigger and more obvious-on-inspection “ugh fields”. I think this is the same underlying mechanism discussed as valence by Steve Byrnes. Motivated reasoning is the name for the resulting cognitive bias. Motivated reasoning overlaps by experimental definition with confirmation bias, the one bias destroying society in Scott Alexander’s terms. After studying cognitive biases through the lens of neuroscience for years, I th nk motivated reasoning is severely hampering progress in alignment, as it is every other project. I have written about it a little in what is the most important cognitive bias to understand, but I want to address more thoroughly how it impacts alignment research.
This post makes a great start at addressing how that’s happening.
I very much agree with the analysis of incentives given here: they are strongly toward tangible and demonstrable progress in any direction vaguely related to the actual problem at hand.
This is a largely separate topic, but I happen to agree that we probably need more experienced thinkers. I disagree that physicists are obviously the best sort of experienced thinkers. I have been a physicist (as an undergrad) and I have watched physicists get into other fields. Their contributions are valuable but far from the final word and are far better when they inspire or collaborate with others with real knowledge of the target field.
There is much more to say on incentives and the field as a whole, but the remainder deserves more careful thought and separate posts.
This analysis of biases and “flinching away” could be applied to many other approaches than the prosaic alignment you target here. I think you’re correct to notice this about prosaic alignment, but it applies to many agent foundations approaches as well.
A relentless focus on the problem at hand, including its most difficult aspects, is absolutely crucial. Those difficult aspects include the theoretical concerns you link to up front, which prosaic alignment largely fails to address. But the difficult spots also include the inconvenient fact that the world is rushing toward building LLM-based or at least deep net based AGI very rapidly, and there are no good ideas about how to make them stop while we go look in a distant but more promising spot to find some keys. Most agent foundations work seems to flinch away from this aspect. Both broad schools largely flinch away from the social, political, and economic aspects of the problem.
We are a lens that can see its flaws, but we need to work to see them clearly. This difficult self-critique of locating our flinches and ugh fields is what we all as individuals, and the field as a collective, need to do to see clearly and speed up progress.
One particular way this issue could be ameliorated is by encouraging people to write up null results/negative results, and one part of your model here is that a null result doesn’t get reported and thus other people don’t hear about failure, while people do hear about success stories, meaning that there is a selection effect to work on successful programs, and no one hears about the failures to tackle the problem, which is bad for research culture, and negative results not being shown is a universal problem across fields.
Definitely.
Lack of publicly reporting null results was a subtle but huge problem in cognitive neuroscience. It took a while to figure out just how much effort was being wasted running studies that others had already tried and not reported because results were null.
Alignment doesn’t have the same journal gatekeeping system that filters out null results, but there’s probably a pretty strong tendency to report less on lack of progress than actual progress.
So post about it if you worked hard at something and got nowhere. This is valuable information when others choose their problems and approaches.
I do see people doing this; it would probably be valuable if we did it more.
I almost want to say that it sounds like we should recruit people from the same demographic as good startup founders. Almost.
Per @aysja’s list, we want creative people with an unusually good ability to keep themselves on-track, who can fluently reason at several levels of abstraction, and who don’t believe in the EMH. This fits pretty well with the stereotype of a successful technical startup founder – an independent vision, an ability to think technically and translate that technical vision into a product customers would want (i. e., develop novel theory and carry it across the theory-practice gap), high resilience in the face of adversity, high agency, willingness to believe you can spot an exploitable pattern where no-one did, etc.
… Or, at least, that is the stereotype of a successful startup founder from Paul Graham’s essays. I expect that this idealized image diverges from reality in quite a few ways. (I haven’t been following Silicon Valley a lot, but from what I’ve seen, I’ve not been impressed with all the LLM and LLM-wrapper startups. Which made me develop quite a dim image of what a median startup actually looks like.)
Still, when picking whom to recruit, it might be useful to adopt some of the heuristics Y Combinator/Paul Graham (claim to) employ when picking which startup-founder candidates to support?
(Connor Leahy also makes a similar point here: that pursuing some ambitious non-templated vision in the real world is a good way to learn lessons that may double as insights regarding thorny philosophical problems.)
At least from the MATS perspective, this seems quite wrong. Only ~20% of MATS scholars in the last ~4 programs have been undergrads. In the most recent application round, the dominant sources of applicants were, in order, personal recommendations, X posts, AI Safety Fundamentals courses, LessWrong, 80,000 Hours, then AI safety student groups. About half of accepted scholars tend to be students and the other half are working professionals.
Putting venues aside, I’d like to build software (like AI-aided) to make it easier for the physics post-docs to onboard to the field and focus on the ‘core problems’ in ways that prevent recoil as much as possible. One worry I have with ‘automated alignment’-type things is that it similarly succumbs to the streetlight effect due to models and researchers having biases towards the types of problems you mention. By default, the models will also likely just be much better at prosaic-style safety than they will be at the ‘core problems’. I would like to instead design software that makes it easier to direct their cognitive labour towards the core problems.
I have many thoughts/ideas about this, but I was wondering if anything comes to mind for you beyond ‘dedicated venues’ and maybe writing about it.
If you wanted to create such a community, you could try spinning up a Discord server?
I’m not saying that this would necessarily be a step in the wrong direction, but I don’t think think a discord server is capable of fixing a deeply entrenched cultural problem among safety researchers.
If moderating the server takes up a few hours of John’s time per week the opportunity cost probably isn’t worth it.
Maybe someone else could moderate it?
A few thoughts.
Have you checked what happens when you throw physic postdocs at the core issues—do they actually get traction or just stare at the sheer cliff for longer while thinking? Did anything come out of the Illiad meeting half a year later? Is there a reason that more standard STEMs aren’t given an intro into some of the routes currently thought possibly workable, so they can feel some traction? I think either could be true- that intelligence and skills aren’t actually useful right now, the problem is not tractable, or better onboarding could let the current talent pool get traction—and either way it might not be very cost effective to get physics postdocs involved.
Humans are generally better at doing things when they have more tools available. While the ‘hard bits’ might be intractable now, they could well be easier to deal with in a few years after other technical and conceptual advances in AI, and even other fields. (Something something about prompt engineering and Anthropic’s mechanistic interpretability from inside the field and practical quantum computing outside).
This would mean squeezing every drop of usefulness out of AI at each level of capability, to improve general understanding and to leverage it into breakthroughs in other fields before capabilities increase further. In fact, it might be best to sabotage semiconductor/chip production once the models one gen before super-intelligence/extinction/ whatever, giving maximum time to leverage maximum capabilities and tackle alignment before the AIs get too smart.
How close is mechanistic interpretability to the hard problems, and what makes it not good enough?
Mathematics?
High variance. A lot of mathematics programs allow one to specialize in fairly narrow subjects IIUC, which does not convey a lot of general technical skill. I’m sure there are some physics programs which are relatively narrow, but my impression is that physics programs typically force one to cover a pretty wide volume of material.
I think there is an obvious signal that could be used: a forecast of how much MIRI will like the research when asked in 5 years. (Note that I don’t mean just asking MIRI now, but rather something like prediction markets or super-forecasters to predict what MIRI will say 5 years from now.)
Basically, if the forecast is above average, anyone who trusts MIRI should fund them.
Yeah it does seem unfortunate that there’s not a robust pipeline for tackling the “hard problem” (even conditioned to more “moderate” models of x-risk)
But (conditioned on “moderate” models) there’s still a log of low-hanging fruit that STEM people from average universities (a group I count myself among) can pick. Like it seems good for Alice to bounce off of ELK and work on technical governance, and for Bob to make incremental progress on debate. The current pipeline/incentive system is still valuable, even if it systematically neglects tackling the “hard problem of alignment”.
I’ve always been sympathetic to the drunk in this story. If the key is in the light, there is a chance of finding it. If it is in the dark, he’s not going to find it anyway so there isn’t much point in looking there.
Given the current state of alignment research, I think it’s fair to say that we don’t know where the answer will come from. I support The Plan and I hope research continues on it. But if I had to guess, alignment will not be solved via getting a bunch of physicists thinking about agent foundations. It will be solved by someone who doesn’t know better making a discovery they “wasn’t supposed to work”.
On an interesting side here a fun story about experts repeatedly failing to make an obvious-in-hindsight discovery because they “knew better”.
A different way to think about types of work is within current ML paradigms vs outside of them. If you believe that timelines are short (e.g. 5 years or less), it makes much more sense to work within current paradigms, otherwise there’s very little chance your work will become adopted in time to matter. Mainstream AI, with all of its momentum, is not going to adopt a new paradigm overnight.
If I understand you correctly, there’s a close (but not exact) correspondence between work I’d label in-paradigm and work you’d label as “streetlighting”. On my model the best reason to work in-paradigm is because that’s where your work has a realistic chance to make a difference in this world.
So I think it’s actually good to have a portfolio of projects (maybe not unlike the current mix), from moonshots to very prosaic approaches.