Thanks for the article. For what it’s worth, here’s the defence I give of Agent Foundations and associated research, when I am asked about it (for background, I’m a mathematician, now working on mathematical aspects of AI safety different from Agent Foundations). I’d be interested if you disagree with this framing.
We can imagine the alignment problem coming in waves. Success in each wave merely buys you the chance to solve the next. The first wave is the problem we see in front of us right now, of getting LLMs to Not Say Naughty Things, and we can glimpse a couple of waves after that. We don’t know how many waves there are, but it is reasonable to expect that beyond the early waves our intuitions probably aren’t worth much.
That’s not a surprise! As physics probed smaller scales, at some point our intuitions stopped being worth anything, and we switched to relying heavily on abstract mathematics (which became a source of different, more hard-won intuitions). Similarly, we can expect that as we scale up our learning machines, we will enter a regime where current intuitions fail to be useful. At the same time, the systems may be approaching more optimal agents, and theories like Agent Foundations start to provide a very useful framework for reasoning about the nature of the alignment problem.
So in short I think of Agent Foundations as like quantum mechanics: a bit strange perhaps, but when push comes to shove, one of the few sources of intuition we have about waves 4, 5, 6, … of the alignment problem. It would be foolish to bet everything on solving waves 1, 2, 3 and then be empty handed when wave 4 arrives.
It’s an interesting framing, Dan. Agent foundations for Quantum superintelligence. To me, motivation for Agent Foundations mostly comes from different considerations. Let me explain.
To my mind, agent foundations is not primarily about some mysterious future quantum superintelligences (though hopefully it will help us when they arrive!) - but about real agents in This world, TODAY.
That means humans and animals but also many systems that are agentic to a degree like markets, large organizations, Large Language models etc. One could call these pseudo-agents or pre-agents or egregores but at the moment there is no accepted terminology for not-quite-agents which may contribute to the persistent confusion that agent foundations is only concerned with expected utility maximizers.
The reason that so far research in Agent Foundations has mostly restricted itself to highly ideal ‘optimal’ agents is primarily because of mathematical tractability. Focusing on highly ideal agents also make sense from the point of view where we are focused on ‘reflectively stable agents’ i.e. we’d like to know what agents converge to upon-reflection. But primarily the reason we don’t much study more complicated, complex realistic models of real-life agents is that the mathematics simply isn’t there yet.
A different perspective on agent foundations is primarily that of deconfusion: we are at present confused about many of the key concepts of aligning future superintelligent agents. We need to be less confused.
Another point of view on the importance of Agent Foundations: Ultimately, it is inevitable that humanity will delegate more and more power to AIs. Ensuring the continued surviving and flourishing of the human species is then less about interpretability, more about engineering reflectively stable well-steered superintelligent systems. This is more about decision theory & (relatively) precise engineering, less about the online neuroscience of mechInterp. Perhaps this is what you meant by the waves of AI alignment.
Thanks for the article. For what it’s worth, here’s the defence I give of Agent Foundations and associated research, when I am asked about it (for background, I’m a mathematician, now working on mathematical aspects of AI safety different from Agent Foundations). I’d be interested if you disagree with this framing.
We can imagine the alignment problem coming in waves. Success in each wave merely buys you the chance to solve the next. The first wave is the problem we see in front of us right now, of getting LLMs to Not Say Naughty Things, and we can glimpse a couple of waves after that. We don’t know how many waves there are, but it is reasonable to expect that beyond the early waves our intuitions probably aren’t worth much.
That’s not a surprise! As physics probed smaller scales, at some point our intuitions stopped being worth anything, and we switched to relying heavily on abstract mathematics (which became a source of different, more hard-won intuitions). Similarly, we can expect that as we scale up our learning machines, we will enter a regime where current intuitions fail to be useful. At the same time, the systems may be approaching more optimal agents, and theories like Agent Foundations start to provide a very useful framework for reasoning about the nature of the alignment problem.
So in short I think of Agent Foundations as like quantum mechanics: a bit strange perhaps, but when push comes to shove, one of the few sources of intuition we have about waves 4, 5, 6, … of the alignment problem. It would be foolish to bet everything on solving waves 1, 2, 3 and then be empty handed when wave 4 arrives.
It’s an interesting framing, Dan. Agent foundations for Quantum superintelligence. To me, motivation for Agent Foundations mostly comes from different considerations. Let me explain.
To my mind, agent foundations is not primarily about some mysterious future quantum superintelligences (though hopefully it will help us when they arrive!) - but about real agents in This world, TODAY.
That means humans and animals but also many systems that are agentic to a degree like markets, large organizations, Large Language models etc. One could call these pseudo-agents or pre-agents or egregores but at the moment there is no accepted terminology for not-quite-agents which may contribute to the persistent confusion that agent foundations is only concerned with expected utility maximizers.
The reason that so far research in Agent Foundations has mostly restricted itself to highly ideal ‘optimal’ agents is primarily because of mathematical tractability. Focusing on highly ideal agents also make sense from the point of view where we are focused on ‘reflectively stable agents’ i.e. we’d like to know what agents converge to upon-reflection. But primarily the reason we don’t much study more complicated, complex realistic models of real-life agents is that the mathematics simply isn’t there yet.
A different perspective on agent foundations is primarily that of deconfusion: we are at present confused about many of the key concepts of aligning future superintelligent agents. We need to be less confused.
Another point of view on the importance of Agent Foundations: Ultimately, it is inevitable that humanity will delegate more and more power to AIs. Ensuring the continued surviving and flourishing of the human species is then less about interpretability, more about engineering reflectively stable well-steered superintelligent systems. This is more about decision theory & (relatively) precise engineering, less about the online neuroscience of mechInterp. Perhaps this is what you meant by the waves of AI alignment.