My view of the development of the field of AI alignment is pretty much the exact opposite of yours: theoretical agent foundations research, what you describe as research on the hard parts of the alignment problem, is a castle in the clouds. Only when alignment researchers started experimenting with real-world machine learning models did AI alignment become grounded in reality. The biggest epistemic failure in the history of the AI alignment community was waiting too long to make this transition.
Early arguments for the possibility of AI existential risk (as seen, for example, in the Sequences) were largely based on 1) rough analogies, especially to evolution, and 2) simplifying assumptions about the structure and properties of AGI. For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.
As neural networks started to see increasing success at a wide variety of problems in the mid-2010s, it started to become apparent that the analogies and assumptions behind early AI x-risk cases didn’t apply to them. The process of developing an ML model isn’t very similar to evolution. Neural networks use finite amounts of compute, have internals that can be probed and manipulated, and behave in ways that can’t be rounded off to decision theory. On top of that, it became increasingly clear as the deep learning revolution progressed that even if agent foundations research did deliver accurate theoretical results, there was no way to put them into practice.
But many AI alignment researchers stuck with the agent foundations approach for a long time after their predictions about the structure and behavior of AI failed to come true. Indeed, the late-2000s AI x-risk arguments still get repeated sometimes, like in List of Lethalities. It’s telling that the OP uses worst-case ELK as an example of one of the hard parts of the alignment problem; the framing of the worst-case ELK problem doesn’t make any attempt to ground the problem in the properties of any AI system that could plausibly exist in the real world, and instead explicitly rejects any such grounding as not being truly worst-case.
Why have ungrounded agent foundations assumptions stuck around for so long? There are a couple factors that are likely at work:
Agent foundations nerd-snipes people. Theoretical agent foundations is fun to speculate about, especially for newcomers or casual followers of the field, in a way that experimental AI alignment isn’t. There’s much more drudgery involved in running an experiment. This is why I, personally, took longer than I should have to abandon the agent foundations approach.
Game-theoretic arguments are what motivated many researchers to take the AI alignment problem seriously in the first place. The sunk cost fallacy then comes into play: if you stop believing that game-theoretic arguments for AI x-risk are accurate, you might conclude that all the time you spent researching AI alignment was wasted.
Rather than being an instance of the streetlight effect, the shift to experimental research on AI alignment was an appropriate response to developments in the field of AI as it left the GOFAI era. AI alignment research is now much more grounded in the real world than it was in the early 2010s.
Given that you speak with such great confidence that historical arguments for AI X-risk were not grounded, can you give me any “grounded” predictions about what superintelligent systems will do? (which I think we both agree is ultimately what will determine the fate of the world and universe)
If you make some concrete predictions then we can start arguing about the validity, but I find this kind of “mightier than thou” attitude where people keep making ill-defined statements like “these things are theoretical and don’t apply”, but without actually providing any answers to the crucial questions.
Indeed, not only that, I am confident that if you were to try to predict what will happen with superintelligence, you would very quickly start drawing on the obvious analogies to optimizers and dutch book arguments and evolution and goodhearts law, because we really don’t have anything better.
The behavior of the ASI will be a collection of heuristics that are activated in different contexts.
The ASI’s software will not have any component that can be singled out as the utility function, although it may have a component that sets a reinforcement schedule.
The ASI will not wirehead.
The ASI’s world-model won’t have a single unambiguous self-versus-world boundary. The situational awareness of the ASI will have more in common with that of an advanced meditator than it does with that of an idealized game-theoretic agent.
First, I don’t think these are controversial predictions on LW (yes, a few people might disagree with him, but there is little boldness or disagreement with widely held beliefs in here), but most importantly, these predictions aren’t about anything I care about. I don’t care whether the world-model will have a single unambiguous self-versus-world boundary, I care whether the system is likely to convert the solar system into some form of computronium, or launch Dyson probes, or eliminate all potential threats and enemies, or whether the system will try to subvert attempts at controlling it, or whether it will try to amass large amounts of resources to achieve its aims, or be capable of causing large controlled effects via small information channels, or is capable of discovering new technologies with great offensive power.
The only bold prediction here is maybe “the behavior of the ASI will be a collection of heuristics”, and indeed would take a bet against this. Systems under reflection and extensive self-improvement stop being well-described by contextual heuristics, and it’s likely ASI will both self-reflect and self-improve (as we are trying really hard to cause both to happen). Indeed, I already wouldn’t particularly describe Claude as a collection of contextual heuristics, there is really quite a lot of consistent personality in there (which of course, you can break with jailbreaks and stuff, but clearly the system is a lot less contextual than base models, and it seems like you are predicting a reversal of that trend?).
“heuristics activated in different contexts” is a very broad prediction. If “heuristics” include reasoning heuristics, then this probably includes highly goal-oriented agents like Hitler.
Also, some heuristics will be more powerful and/or more goal-directed, and those might try to preserve themselves (or sufficiently similar processes) more so than the shallow heuristics. Thus, I think eventually, it is plausible that a superintelligence looks increasingly like a goal-maximizer.
For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.
It’s one of the most standard results in ML that neural nets are universal function approximators. In the context of that proof, ML de-facto also assumes that you have infinite computing power. It’s just a standard tool in ML, AI or CS to see what models predict when you take them to infinity. Indeed, it’s really one of the most standard tools in the modern math toolbox, used by every STEM discipline I can think of.
Similarly, separating the boundary between internal decision processes and the outside world continues to be a standard assumption in ML. It’s really hard to avoid, everything gets very loopy and tricky, and yes, we have to deal with that loopiness and trickiness, but if anything, agent foundations people were the actual people trying to figure out how to handle that loopiness and trickiness, whereas the ML community really has done very little to handle it. In contrary to your statement here, people on LW have been for years pointing out how embedded agency is really important, and been dismissed by active practitioners because they think the cartesian boundary here is just fine for “real” and “grounded” applications like “predicting the next token” which clearly don’t have relevance to these weird and crazy scenarios about power-seeking AIs developing contextual awareness that you are talking about.
You do realize that by “alignment”, the OP (John) is not talking about techniques that prevent an AI that is less generally capable than a capable person from insulting the user or expressing racist sentiments?
We seek a methodology for constructing an AI that either ensures that the AI turns out not to be able to easily outsmart us or (if it does turn out to be able to easily outsmart us) ensures (or makes it unlikely) that it won’t kill us all or do something other terrible thing. (The former is not researched much compared to the latter, but I felt the need to include it for completeness.)
The way it is now, it is not even clear whether you and the OP (John) are talking about the same thing (because “alignment” has come to have a broad meaning).
If you want to continue the conversation, it would help to know whether you see a pressing need for a methodology of the type I describe above. (Many AI researchers do not: they think that outcomes like human extinction are quite unlikely or at least easy to avoid.)
My view of the development of the field of AI alignment is pretty much the exact opposite of yours: theoretical agent foundations research, what you describe as research on the hard parts of the alignment problem, is a castle in the clouds. Only when alignment researchers started experimenting with real-world machine learning models did AI alignment become grounded in reality. The biggest epistemic failure in the history of the AI alignment community was waiting too long to make this transition.
Early arguments for the possibility of AI existential risk (as seen, for example, in the Sequences) were largely based on 1) rough analogies, especially to evolution, and 2) simplifying assumptions about the structure and properties of AGI. For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.
As neural networks started to see increasing success at a wide variety of problems in the mid-2010s, it started to become apparent that the analogies and assumptions behind early AI x-risk cases didn’t apply to them. The process of developing an ML model isn’t very similar to evolution. Neural networks use finite amounts of compute, have internals that can be probed and manipulated, and behave in ways that can’t be rounded off to decision theory. On top of that, it became increasingly clear as the deep learning revolution progressed that even if agent foundations research did deliver accurate theoretical results, there was no way to put them into practice.
But many AI alignment researchers stuck with the agent foundations approach for a long time after their predictions about the structure and behavior of AI failed to come true. Indeed, the late-2000s AI x-risk arguments still get repeated sometimes, like in List of Lethalities. It’s telling that the OP uses worst-case ELK as an example of one of the hard parts of the alignment problem; the framing of the worst-case ELK problem doesn’t make any attempt to ground the problem in the properties of any AI system that could plausibly exist in the real world, and instead explicitly rejects any such grounding as not being truly worst-case.
Why have ungrounded agent foundations assumptions stuck around for so long? There are a couple factors that are likely at work:
Agent foundations nerd-snipes people. Theoretical agent foundations is fun to speculate about, especially for newcomers or casual followers of the field, in a way that experimental AI alignment isn’t. There’s much more drudgery involved in running an experiment. This is why I, personally, took longer than I should have to abandon the agent foundations approach.
Game-theoretic arguments are what motivated many researchers to take the AI alignment problem seriously in the first place. The sunk cost fallacy then comes into play: if you stop believing that game-theoretic arguments for AI x-risk are accurate, you might conclude that all the time you spent researching AI alignment was wasted.
Rather than being an instance of the streetlight effect, the shift to experimental research on AI alignment was an appropriate response to developments in the field of AI as it left the GOFAI era. AI alignment research is now much more grounded in the real world than it was in the early 2010s.
Given that you speak with such great confidence that historical arguments for AI X-risk were not grounded, can you give me any “grounded” predictions about what superintelligent systems will do? (which I think we both agree is ultimately what will determine the fate of the world and universe)
If you make some concrete predictions then we can start arguing about the validity, but I find this kind of “mightier than thou” attitude where people keep making ill-defined statements like “these things are theoretical and don’t apply”, but without actually providing any answers to the crucial questions.
Indeed, not only that, I am confident that if you were to try to predict what will happen with superintelligence, you would very quickly start drawing on the obvious analogies to optimizers and dutch book arguments and evolution and goodhearts law, because we really don’t have anything better.
Some concrete predictions:
The behavior of the ASI will be a collection of heuristics that are activated in different contexts.
The ASI’s software will not have any component that can be singled out as the utility function, although it may have a component that sets a reinforcement schedule.
The ASI will not wirehead.
The ASI’s world-model won’t have a single unambiguous self-versus-world boundary. The situational awareness of the ASI will have more in common with that of an advanced meditator than it does with that of an idealized game-theoretic agent.
I… am not very impressed by these predictions.
First, I don’t think these are controversial predictions on LW (yes, a few people might disagree with him, but there is little boldness or disagreement with widely held beliefs in here), but most importantly, these predictions aren’t about anything I care about. I don’t care whether the world-model will have a single unambiguous self-versus-world boundary, I care whether the system is likely to convert the solar system into some form of computronium, or launch Dyson probes, or eliminate all potential threats and enemies, or whether the system will try to subvert attempts at controlling it, or whether it will try to amass large amounts of resources to achieve its aims, or be capable of causing large controlled effects via small information channels, or is capable of discovering new technologies with great offensive power.
The only bold prediction here is maybe “the behavior of the ASI will be a collection of heuristics”, and indeed would take a bet against this. Systems under reflection and extensive self-improvement stop being well-described by contextual heuristics, and it’s likely ASI will both self-reflect and self-improve (as we are trying really hard to cause both to happen). Indeed, I already wouldn’t particularly describe Claude as a collection of contextual heuristics, there is really quite a lot of consistent personality in there (which of course, you can break with jailbreaks and stuff, but clearly the system is a lot less contextual than base models, and it seems like you are predicting a reversal of that trend?).
The trend may be bounded, the trend may not go far by the time AI can invent nanotechnology—would be great if someone actually measured such things.
And there being a trend at all is not predicted by utility-maximization frame, right?
“heuristics activated in different contexts” is a very broad prediction. If “heuristics” include reasoning heuristics, then this probably includes highly goal-oriented agents like Hitler.
Also, some heuristics will be more powerful and/or more goal-directed, and those might try to preserve themselves (or sufficiently similar processes) more so than the shallow heuristics. Thus, I think eventually, it is plausible that a superintelligence looks increasingly like a goal-maximizer.
It’s one of the most standard results in ML that neural nets are universal function approximators. In the context of that proof, ML de-facto also assumes that you have infinite computing power. It’s just a standard tool in ML, AI or CS to see what models predict when you take them to infinity. Indeed, it’s really one of the most standard tools in the modern math toolbox, used by every STEM discipline I can think of.
Similarly, separating the boundary between internal decision processes and the outside world continues to be a standard assumption in ML. It’s really hard to avoid, everything gets very loopy and tricky, and yes, we have to deal with that loopiness and trickiness, but if anything, agent foundations people were the actual people trying to figure out how to handle that loopiness and trickiness, whereas the ML community really has done very little to handle it. In contrary to your statement here, people on LW have been for years pointing out how embedded agency is really important, and been dismissed by active practitioners because they think the cartesian boundary here is just fine for “real” and “grounded” applications like “predicting the next token” which clearly don’t have relevance to these weird and crazy scenarios about power-seeking AIs developing contextual awareness that you are talking about.
You do realize that by “alignment”, the OP (John) is not talking about techniques that prevent an AI that is less generally capable than a capable person from insulting the user or expressing racist sentiments?
We seek a methodology for constructing an AI that either ensures that the AI turns out not to be able to easily outsmart us or (if it does turn out to be able to easily outsmart us) ensures (or makes it unlikely) that it won’t kill us all or do something other terrible thing. (The former is not researched much compared to the latter, but I felt the need to include it for completeness.)
The way it is now, it is not even clear whether you and the OP (John) are talking about the same thing (because “alignment” has come to have a broad meaning).
If you want to continue the conversation, it would help to know whether you see a pressing need for a methodology of the type I describe above. (Many AI researchers do not: they think that outcomes like human extinction are quite unlikely or at least easy to avoid.)