What’s next for the field of Agent Foundations?
Alexander, Matt and I want to chat about the field of Agent Foundations (AF), where it’s at and how to strengthen and grow it going forward.
We will kick off by each of us making a first message outlining some of our key beliefs and open questions at the moment. Rather than giving a comprehensive take, the idea is to pick out 1-3 things we each care about/think are important, and/or that we are confused about/would like to discuss. We may respond to some subset of the following prompts:
Where is the field of AF at in your view? How do you see the role of AF in the larger alignment landscape/with respect to making AI futures go well? Where would you like to see it go? What do yo use as some of the key bottlenecks for getting there? What are some ideas you have about how we might overcome them?
Before we launch in properly, just a few things that seem worth clarifying:
By Agent Foundations, we mean roughly speaking conceptual and formal work towards understanding the foundations of agency, intelligent behavior and alignment. In particular, we mean something broader than what one might call “old-school MIRI-type Agent Foundations”, typically informed by fields such as decision theory and logic.
We will not specifically be discussing the value or theory of change behind Agent Foundations research in general. We think these are important conversations to have, but in this specific dialogue, our goal is a different one, namely: assuming AF is valuable, how can we strengthen the field?
Should it look more like a normal research field?
The main question I’m interested in about agent foundations at the moment is whether it should continue in its idiosyncratic current form, or whether it should start to look more like an ordinary academic field.
I’m also interested in discussing theories of change, to the extent it has bearing on the other question.
Why agent foundations?
My own reasoning for foundational work on agency being a potentially fruitful direction for alignment research is:
Most misalignment threat models are about agents pursuing goals that we’d prefer they didn’t pursue (I think this is not controversial)
Existing formalisms about agency don’t seem all that useful for understanding or avoiding those threats (again probably not that controversial)
Developing new and more useful ones seems tractable (this is probably more controversial)
The main reason I think it might be tractable is that so far not that many person-hours have gone into trying to do it. A priori it seems like the sort of thing you can get a nice mathematical formalism for, and so far I don’t think that we’ve collected much evidence that you can’t.
So I think I’d like to get a large number of people with various different areas of expertise thinking about it, and I’d hope that some small fraction of them discovered something fundamentally important. And a key question is whether the way the field currently works is conducive to that.
Does it need a new name?
Does Agent Foundations-in-the-broad-sense need a new name?
Is the name ‘Agent Foundations’ cursed?
Suggestions I’ve heard are
‘What are minds’, ‘what are agents’. ‘mathematical alignment’. ‘Agent Mechanics’
Epistemic Pluralism and Path to Impact
Some thought snippets:
(1) Clarifying and creating common knowledge about the scope of Agent Foundations and strengthening epistemic pluralism
I think it’s important for the endeavors of meaningfully improving our understanding of such fundamental phenomena as agency, intelligent behavior, etc. that one has a relatively pluralistic portfolio of angles on it. The world is very detailed, phenomena like agency/intelligent behavior/etc. seem like maybe particularly “messy”/detailed phenomena. Insofar as every scientific approach necessarily abstracts on a bunch of detail, and we don’t apriori know which bits of reality are fine to abstract away from and which aren’t in what contexts, having plural perspectives on the same phenomena is a productive approach to coming to “triangulate” the desired phenomena.
This is why I am pretty keen on having a scope of AF that includes but is not limited to “old-school MIRI type AF”. As I see it, the field has already well started producing a larger plurality of perspectives, which is exciting to me. I am further in favour of
creating more common knowledge the scope of AF—I want relative breadth in terms of methodologies, bodies of knowledge, epistemic practices and underlying assumptions, and relative narrowness in terms of the leading questions/epistemic aims of the field.
increasing the pluralism further—I think there are some fairly obviously interesting angles, fields, knowledge basis to bring to bear on the questions of AF, and to integrate into the current conversations in AF and alignment.
work on creating and maintaining surface area between these plural approaches—“triangulation” as described above can only really happen when different perspectives interface and communicate, and as such we need places & interfaces where/through which this can happen
(2) Where does AF sit on the “path to impact”
At a high-level, I think it’s useful to ask: what are the (epistemic) inputs that need to feed into AF? What are the epistemic outputs we want to come out AF, and where do we want them to feed into, such that at the end of this chain we get to something like “safe and aligned AI systems” or similar?
With respect to this, I’m particularity excited for AF to have tight interfaces/iteration loops with more applied aspects of AI alignment work (e.g. interpretability, evals, alignment proposals).
(3) possible prompt: if you had 2 capable FTE and 500′000 USD for AF field building, what would you do?
..suffering from a lack of time, and will stop here for now.
Pockets of Deep Expertise
One of my favorite blogposts is Schubert’s “Against Cluelessness Pockets of Predictability” introducing ‘Pockets of Predicability’:
(...)intuitions about low-variance predictability long held back scientific and technological progress.*** Much of the world was once unknowable to humans, and people may have generalised from that, thinking that systematic study wouldn’t pay off. But in fact knowability varied widely: there were pockets of knowability or predictability that people could understand even with the tools of the day (e.g. naturally simple systems like the planetary movements, or artificially simple systems like low-friction planes). Via these pockets of knowability, we could gradually expand our knowledge—and thus the world was more knowable than it seemed. As Ernest Gellner points out, the Scientific and Industrial Revolutions largely consisted in the realisation that the world is surprisingly knowable:
“the generic or second-order discovery that successful systematic investigation of Nature, and the application of the findings for the purpose of increased output, are feasible, and, once initiated, not too difficult.”
I really like this way of thinking about the possibility of knowledge and development of science. I see a very similar ‘predictability skepticism’ across the field of Alignment.
This predictabilility skepticism is reflected in the Indefinite Optimism of lab-based alignment groups and the Indefinite Pessimism of doomers.
I want to introduce the idea of ‘Pockets of Deep Expertise’. That is—I think much of scientific progress is made by small groups of people, mostly opague from the outside, (‘pockets’) building up a highly specific knowledge over fairly-long time stretches (‘deep expertise’).
These pockets are
often highly opaque & illegible from the outside.
progress is often partial & illegible. The Pocket has solved subproblem A,B, and C of Question X for some reason their methods have not yet been able to solve D. This prevents them from competely answering Question X or building technology Y
progress is made over long-time periods
there are many False Prophets. Not everybody claiming (deep) expertise is actually doing valuable things. Some are outright frauds, others are simply barking up the wrong tree.
As a conservative estimate 90-95% of (STEM) academia is doing work that is ‘predictably irrelevant’, p-hacking and/or bad in various other ways.
So most of academia is indeed not doing useful work. But some pockets areThe variance of pockets is huge.
For the purpose of technical alignment, we need to think like a VC:
bet on a broad range of highly specific bets
To my mind we are currently only employing a tiny fraction of the world’s scientific talent.
Although Alignment now attracts very large group of promising young people, much of their energy and talent is being wasted on reinventing the wheel .
How to Get a Range of Bets
Everyone has mentioned something along the lines of wanting to get a broad range of specific bets or types of people. We could take that as read and discuss how to do it?
(Although if we are going to talk about how we want the field to look, that probably most naturally comes first)
Ok, great. Let’s take stock quickly.
I think we are all interested in some version of “bet on a broad/plural range of highly specific bets”. Maybe we should talk about that more at some point.
To help with the flow of this, it might be useful however to go a bit more concrete first. I suggest we take the follow prompt:
if you had 2 capable FTE and 500′000 USD for AF field building, what would you do?
Reverse MATS
I’ll give the idea I was chatting about with Alexander yesterday as my first answer.
There are probably a large number of academics with expertise in a particular area which seems potentially useful for alignment, and who might be interested in doing alignment research. But they might not know that there’s a connection, or know anything about alignment. And unlike with junior researchers they’re not gonna attend some MATS-type programme to pick it up.
So the idea is “instead of senior alignment researchers helping onboard junior people to alignment research, how about junior alignment people help onboard senior researchers from other areas?” Anti-MATS.
EDIT: Renamed to Reverse MATS because people glancing at the sidebar thought someone in the dialogue was anti MATS. We are pro MATS!
We have a large pool of junior people who’ve read plenty about alignment, but don’t have mentorship. And there’s a large pool of experienced researchers in potentially relevant subjects who don’t know anything about alignment. So we send a junior alignment person to work as a research assistant or something with an experienced researcher in complexity science or active inference or information theory or somewhere else we think there might be a connection, and they look for one together and if they find it perhaps a new research agenda develops.
Yeah, I like this direction. I agree with the problem statement. I’m not sure “junior helping senior person” is maybe helpful but unsure it’s the crux to getting this thing right. Here is what i think might be some cruxes/bottlenecks:
“Getting self-selection right”: how do ‘senior scholars’ find the ‘anti-MATS’ program, and what makes them decide to do it?
One thing I think you need here is to create a surface area to the sorts of questions that agent foundations for alignment is interested in such that people with relevant expertise can grok those problems and as how their expertise is relevant to them
For identifying more senior people, I think you need some things like workshops, conferences and network rather than being able to rely on open applications.
I think you’d have to approach researchers individually to see if they’d like to be involved.
The most straightforward examples would be people who work in a pretty obviously related area or who are known to have some interest in alignment already (I think both were true in the case of Dan Murfet and SLT?) or who know some alignment people personally. My guess is this category is reasonably large.
Beyond that, if you have to make a cold pitch to someone about the relevance of alignment (in general and as a research problem for them) I think it’s a lot more difficult.
I don’t think, for example, there’s a good intro resource you can send somebody that makes a common-sense case for “basic research into agency could be useful for avoiding risks from powerful AI”, especially not one that has whatever hallmarks of legitimacy make it easy for an academic to justify a research project based on.
Yeah cool. I guess another question is: once you identified them, what do they need to succeed?
I’ve definitely also seen the failure mode where someone is only or too focused on “the puzzles of agency” without having an edge in linking those questions up with AI risk/alignment. Some ways of asking about/investigating agency are more and less relevant to alignment, so I think it’s important that there is a clear/strong enough “signal” from the target domain (here: AI risk/alignment) to guide the search/research directions
Yes, I agree with this.
I wonder whether focusing on agency is not even the right angle for this, and ‘alignment theory’ is more relevant. Probably what would be most useful for those researchers would be to have the basic problems of alignment made clear to them, and if they think that focusing on agency is a good way to attack those problems given their expertise then they can do that, but if they don’t see that as a good angle they can pursue a different one.
I do think that having somebody who’s well-versed in the alignment literature around (i.e. the proposed mentee) is potentially very impactful. There’s a bunch of ideas that are very obvious to people in the alignment community because they’re talked about so often (e.g. the training signal is not necessarily the goal of the trained model) that might not be obvious to someone thinking from first principles. A busy person coming in from another area could just miss something, and end up creating a whole research vision which is brought down by a snag that would have been obvious to an inexperienced researcher who’s read a lot of LW.
seniorMATS—a care home for ai safety researchers in the twilight of their career
Yes, good surface area to the problem is important. I think there is a good deal of know-how around on this by now. From introductory materials, to people with experience running the sort of research retreats that provide good initial contact with the space, to (as you describe) indivuals who could help/assist/facilitate along the way. Also worth asking what the role of a peer environment should/could be (e.g. a AF discord type thing, and/or something a bit more high-bandwidth)
Also, finding good general “lines of attack” might pretty useful here. For example, I have fond Evan’s “model organism” to be a pretty good/generative frame getting AF type work to be more productively oriented towards concrete/applied alignment work.
Alignment Noob Training—Inexperienced Mentees Actually Teach Seniors (ANTIMATS)
My model here puts less emphasise on “juniro researchers mentoring up”, and more on “creating the right surface area for people wiht the relevanat expertise” more generaly; one way to do do this may be junior researchers with more alginment exposure, but I don’t think that should be the central pillar.
the three things I am looking for in an academic (or nonacademic) researcher with scientific potential is
alignmentpilled—important.
You don’t want them to run off doing capability work. There is an almost just as pernicious failure where people say they care about ‘alignment’ but they don’t really. Often this is variants where alignment and safety becomes a vague buzzword that gets co-opted for whatever their hobbyhorse was.
2. belief in ‘theory’ - they think alignment is a deep technical problem and believe that we will need scientific & conceptual progress. Experiments are important but pure empirics is not sufficient to guarantee safety. Many people conclude (perhaps rightly so!) that technical alignment is too difficult and governance is the answer.
3. swallowed the bitter lesson - unfortunately, there are still researchers who do not accept that LLMs are here. These are especially common, surprisingly perhaps, in AI and ML departments. Gary Marcus adherents in various guises. More generally, there is a failure mode of disinterest in deep learning practice.
“creating the right surface area for people wiht the relevanat expertise”
That seems right. Creating a peer network for more senior people coming into the field from other areas seems like it could be similarly impactful.
Appealing to Researchers
You don’t convince academics with money. You convince them with ideas. Academics are mental specialists. They have honed very specific mental skills over many years. To convince them to work on something you have to convince them that 1. the problems is tractable 2. fruitful & interesting and most importantly 3. vulnerable to the specific methods that this academic researcher has in their toolkit.
Another idea that Matt suggested was a BlueDot -style “Agent Foundations-in-the-broad-sense’ course.
\EuclideanGeometry rant
The impact of Euclidean Geometry on Western Intellectual thought has been immense. But it is slighly surprising: Euclid’s geometry has approximately no application. Here I mean Euclid’s geometry as in the proof-based informal formal system of Euclidean geometry as put forward in Euclid’s Elements.
It is quite interesting how the impact actually worked. Many thinkers cite euclidean geometry as decisive for their thinking—Descartes, Newton, Benjamin Franklin, Kant to name just a few. I think the reason is that it formed the ‘model organism’ of what conceptual, theoretical progress could look like. The notion of proof (which is interestingly unique to the Western mathematical tradition, despite ?15th century Kerala, India e.g. discovering Taylor series before Newton), the notion of true certainty, notion of modelling and idealizations, the idea of stacking many lemmas etc.
I think this kind of ‘succesful conceptual/theoretical progress’ is highly important in inspiring people both historically and currently.
I think the purpose of such an AF course would be to show academic researchers that there is real intellectual substance to conceptual Alignment work
[at this point we ran out of our time box and decided to stop]
I disagree—I think that we need more people on the margin who are puzzling about agency, relative to those who are backchaining from a particular goal in alignment. Like you say elsewhere, we don’t yet know what abstractions make sense here; without knowing what the basic concepts of “agency” are it seems harmful to me to rely too much on top-down approaches, i.e., ones that assume something of an end goal.
In part that’s because I think we need higher variance conceptual bets here, and I think that over-emphasizing particular problems in alignment risks correlating people’s minds. In part it’s because I suspect that there are surprising, empirical things left to learn about agency that we’ll miss if we prefigure the problem space too much.
But also: many great scientific achievements have been preceded by bottom-up work (e.g., Shannon, Darwin, Faraday), and afaict their open-ended, curious explorations are what laid the groundwork for their later theories. I feel that it is a real mistake to hold all work to the same standards of legible feedback loops/backchained reasoning/clear path to impact/etc, given that so many great scientists did not follow this. Certainly, once we have a bit more of a foundation this sort of thing seems good to me (and good to do in abundance). But I think before we know what we’re even talking about, over-emphasizing narrow, concrete problems risks the wrong kind of conceptual research—the kind of “predictably irrelevant” work that Alexander gestures towards.
The title of this dialogue promised a lot, but I’m honestly a bit disappointed by the content. It feels like the authors are discussing exactly how to run particular mentorship programs and structure grants and how research works in full generality, while no one is actually looking at the technical problems. All field-building efforts must depend on the importance and tractability of technical problems, and this is just as true when the field is still developing a paradigm. I think a paradigm is established only when researchers with many viewpoints build a sense of which problems are important, then try many approaches until one successfully solves many such problems, thus proving the value of said approach. Wanting to find new researchers to have totally new takes and start totally new illegible research agendas is a level of helplessness that I think is unwarranted—how can one be interested in AF without some view on what problems are interesting?
I would be excited about a dialogue that goes like this, though the format need not be rigid:
What are the most important [1] problems in agent foundations, with as much specificity as possible?
Responses could include things like:
A sound notion of “goals with limited scope”: can’t nail down precise desiderata now, but humans have these all the time, we don’t know what they are, and they could be useful in corrigibility or impact measures.
Finding a mathematical model for agents that satisfies properties of logical inductors but also various other desiderata
Further study of corrigibility and capability of agents with incomplete preferences
Participants discuss how much each problem scratches their itch of curiosity about what agents are.
What techniques have shown promise in solving these and other important problems?
Does [infra-Bayes, Demski’s frames on embedded agents, some informal ‘shard theory’ thing, …] have a good success to complexity ratio?
probably none of them do?
What problems would benefit the most from people with [ML, neuroscience, category theory, …] expertise?
[1]: (in the Hamming sense that includes tractability)
You may be positively surprised to know I agree with you. :)
For context, the dialogue feature just came out on LW. We gave it a try and this was the result. I think we mostly concluded that the dialogue feature wasn’t quite worth the effort. Anyway
I like what you’re suggesting and would be open to do a dialogue about it !
I’d like to gain clarity on what we think the relationship should be between AI alignment and agent foundations. To me, the relationship is 1) historical, in that the people bringing about the field of agent foundations are coming from the AI alignment community and 2) motivational, in that the reason they’re investigating agent foundations is to make progress on AI alignment, but not 3) technical, in that I think agent foundations should not be about directly answering questions of how to make the development of AI beneficial to humanity. I think it makes more sense to pursue agent foundations as a quest to understand the nature of agents as a technical concept in its own right.
If you are a climate scientist, then you are very likely in the field in order to help humanity reduce the harms from climate change. But on a day-to-day basis, the thing you are doing is trying to understand the underlying patterns and behavior of the climate as a physical system. It would be unnatural to e.g. exclude papers from climate science journals on the grounds of not being clearly applicable to reducing climate change.
For agent foundations, I think some of the core questions revolve around things like, how does having goals work? How stable are goals? How retargetable are goals? Can we make systems that optimize strongly but within certain limitations? But none of those question are are directly about aligning the goals with humanity.
There’s also another group of questions like, what are human’s goals? How can we tell? How complex and fragile are they? How can we get an AI system to imitate a human? Et cetera. But I think these questions come from a field that is not agent foundations.
There should certainly be constant and heavy communication between these fields. And I also think that even individual people should be thinking about the applicability questions. But they’re somewhat separate loops. A climate scientist will have an outer loop that does things like, chooses a research problem because they think the answer might help reduce climate change, and they should keep checking on that belief as they perform their research. But while they’re doing their research, I think they should generally be using an inner loop that just thinks, “huh, how does this funny ‘climate’ thing work?”
This is somewhat unsurprising given human psychology.
- Scaling up LLMs killed a lot of research agendas inside ML, particularly NLP. Imagine your whole research career was built on improving benchmarks on some NLP problem using various clever ideas. Now, the whole thing is better solved by three sentence prompt to GPT4 and everything everyone in the subfield worked on is irrelevant for all practical purposes… how do you feel? In love with scaled LLMs?
- Overall, people often like about research is coming up with smart ideas, and there is some aesthetics going into it. What’s traditionally not part of the aesthetics is ‘and you also need to get $100M in compute’, and it’s reasonably to model a lot of people as having a part which hates this.
Kinda like mathematicians hated it when the four color theorem was solved by a computer brute-forcing thousands of options. Only imagine that the same thing happens to hundreds of important mathematical problems—the proper way to solve them becomes to reduced them to a huge by finite number of cases, then throw lots of money at a computer who will handle these cases one by one, producing a “proof” that no human will ever be able to verify directly.
My talk for the alignment workshop at the ALIFE conference this past summer was roughly what I think you want. Unfortunately I don’t think it was recorded. Slides are here, but they don’t really do it on their own.
FWIW I also think the “Key Phenomena of AI risk” reading curriculum (h/t TJ) does some of this at least indirectly (it doesn’t set out to directly answer this question, but I think a lot of the answers to the question are comprise in the curriculum).
(Edit: fixed link)
How confident are you about it not having been recorded? If not very, seems props worth checking again
The workshop talks from the previous year’s ALIFE conference (2022) seem to be published on YouTube, so I’m following up with whether John’s talk from this year’s conference can be released as well.
The video of John’s talk has now been uploaded on YouTube here.
I mean, I could always re-present it and record if there’s demand for that.
… or we could do this the fun way: powerpoint karaoke. I.e. you make up the talk and record it, using those slides. I bet Alexander could give a really great one.
I have no doubt Alexander would shine!
Happy to run a PIBBSS speaker event for this, record it and make it publicly available. Let me know if you’re keen and we’ll reach out to find a time.
To follow up on this, we’ll be hosting John’s talk on Dec 12th, 9:30AM Pacific / 6:30PM CET.
Join through this Zoom Link.
Title: AI would be a lot less alarming if we understood agents
Description: In this talk, John will discuss why and how fundamental questions about agency—as they are asked, among others, by scholars in biology, artificial life, systems theory, etc. - are important to making progress in AI alignment. John gave a similar talk at the annual ALIFE conference in 2023, as an attempt to nerd-snipe researchers studying agency in a biological context.
--
To be informed about future Speaker Series events by subscribing to our SS Mailing List here. You can also add the PIBBSS Speaker Events to your calendar through this link.
FYI this link redirects to a UC Berkeley login page.
Maybe an even better analogy is non-Euclidean geometry. Agent foundations is studying a strange alternate world where agents know the source code to themselves and the universe, where perfect predictors exist and so on. It’s not an abstraction of our world, but something quite different. But surprisingly it turns out that many aspects of decision-making in our world have counterparts in the alternate world, and in doing so we shed a strange light on what decision-making in our world actually means.
I’m not even sure these investigations should be tied to AI risk (though that’s very important too). To me the other world offers mathematical and philosophical interest on its own, and frankly I’m curious where these investigations will lead (and have contributed to them where I could).
Modelling always requires idealisation. Currently, in many respects the formal models that Agent Foundations use to capture the informal notion of agency, intention, goal etc are highly idealised. This is not an intrinsic feature of Agent Foundations or mathematical modelling- just a reflection of the inadequate mathematical and conceptual state of the world.
By analogy—intro to Newtonian Mechanics begins with frictionless surfaces and the highly simple orbits of planetary systems. That doesn’t mean that Newtonian Mechanics in more sophisticated forms cannot be applied to the real world.
One can get lost in the ethereal beauty of ideal worlds. That should not detract from the ultimate aim of mathematical modelling of the real world.
I just want to flag that this is very much not a defining characteristic of agent foundations! Some work in agent foundations will make assumptions like this, some won’t—I consider it a major goal of agent foundations to come up with theories that do not rely on assumptions like this.
(Or maybe you just meant those as examples?)
I would love this and take this myself, fwiw. (Even if I didn’t get in, I’d still make “working through such a course’s syllabus” one of my main activities in the short term.)
FWIW I saw “Anti-MATS” in the sidebar and totally assumed that meant that someone in the dialogue was arguing that the MATS program was bad (instead of discussing the idea of a program that was like MATS but opposite).
Same. My friend Bob suggests “co-MATS”
“Reverse MATS”?
(I think I agree that “co-MATS” is in some sense a more accurate description of what’s going on, but Reverse MATS feels like it gets the idea across better at first glance)
Oops, thanks, I’ve changed it to Reverse MATS to avoid confusion.