The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.
The bitter lesson applies to alignment as well. Stop trying to think about “goal slots” whose circuit-level contents should be specified by the designers, or pining for a paradigm in which we program in a “utility function.” That isn’t how it works. See:
the failure of the agent foundations research agenda;
Don’t forget that hypothesis’s impact on classic AI risk! Notice how the following speculations about “explicit adaptations” violate information inaccessibility and also the bitter lesson of “online learning and search are. much more effective than hardcoded concepts and algorithms”:
“Humans usually do notice sunk costs—this is presumably either an adaptation to prevent us from switching strategies too often (compensating for an overeager opportunity-noticer?) or an unfortunate spandrel of pain felt on wasting resources.”
“the parental grief adaptation”
“this selection pressure was not only great enough to fine-tune parental grief, but, in fact, carve it out of existence from scratch in the first place.”
“The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn’t spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.” (source)
“In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence Z.” (source)
“When, today, you get into an argument about whether “we” ought to raise the minimum wage, you’re executing adaptations for an ancestral environment where being on the wrong side of the argument could get you killed.”
Much of classical alignment theory violates now-known lessons about the nature of effective intelligence. These bitter lessons were taught to us by deep learning.
The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.
The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
The examples of human cognition you point to are the dumbest parts of human cognition. They are the parts we need to override in order to pursue non-standard goals. For example, in political arguments, the adaptations that we execute that make us attached to one position are bad. They are harmful to our goal of implementing effective policy. People who are good at finding effective government policy are good at overriding these adaptations.
“All these are part of the arbitrary, intrinsically-complex, outside world.” This seems wrong. The outside world isn’t that complex, and reflections of it are similarly not that complex. Hardcoding knowledge is a mistake, of course, but understanding a knowledge representation and updating process needn’t be that hard.
“They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.” I agree with this, but it’s also fairly obvious. The difficulty of alignment is building these in such a way that you can predict that they will continue to work, despite the context changes that occur as an AI scales up to be much more intelligent.
“These bitter lessons were taught to us by deep learning.” It looks to me like deep learning just gave most people an excuse to not think very much about how the machine is working on the inside. It became tractable to build useful machines without understanding why they worked.
It sounds like you’re saying that classical alignment theory violates lessons like “we shouldn’t hardcode knowledge, it should instead be learned by very general methods”. This is clearly untrue, but if this isn’t what you meant then I don’t understand the purpose of the last quote. Maybe a more charitable interpretation is that you think the lesson is “intelligence is irreducibly complex and it’s impossible to understand why it works”. But this is contradicted by the first quote. The meta-methods are a part of a mind that can and should be understood. And this is exactly the topic that much of agent foundations research has been about (with a particular focus on the aspects that are relevant to maintaining stability through context changes).
(My impression was that this is also what shard theory is trying to do, except with less focus on stability through context changes, much less emphasis on fully-general outcome-directedness, and more focus on high-level steering-of-plans-during-execution instead of the more traditional precise-specification-of-outcomes).
Current systems don’t have a goal slot, but neither are they agentic enough to be really useful. An explicit goal slot is highly useful when carrying out complex tasks that have subgoals. Humans definitely functionally have a “goal slot” although the way goals are selected and implemented is complex.
And it’s trivial to add a goal slot; with a highly intelligent LLM, one prompt called repeatedly will do:
Act as a helpful assistant carrying out the user’s instructions as they were intended. Use these tools to gather information, including clarifying instructions, and take action as necessary [tool descriptions and APIs].
While the bitter lesson is somewhat relevant, selecting and interpreting goals seems likely to be the core consideration once we expand current network AI into more useful (and dangerous) agents.
I think point 4 is not very justified. For example, chicken have pretty much hardcoded object permanence, while Sora, being insanely good at video generation, struggles[1] with it.
My hypothesis here is it’s hard to learn object permanence by SGD, but very easy for evolution (you don’t have it, you die and after random search finds it, it spreads as far as possible).
The other example is that, apparently, cognitive specialization in human (and higher animals) brain got so far that neocortex is incapable to learn conditioned response, unlike cerebellum. Moreover, it’s not like neocortex is just unspecialized and cerebellum does this job better, patients with cerebellum atrophy simply don’t have conditioned response, period. I think this puts most analogies between brain and ANNs under very huge doubt.
My personal hypothesis here is that LLMs evolve “backwards” relatively to animals. Animals start as set of pretty simple homeostatic control algorithms and hardcoded sphexish behavioral programs and this sets hard design constraint on development of world-modeling parts of brain—getting flexibility and generality should not disrupt already existing proven neural mechanisms. Speculatively, we can guess that pure world modeling leads to expected known problems like “hallucinations”, so brain evolution is mostly directed by necessity to filter faulty outputs of the world model. For example, non-human animals reportedly don’t have schizophrenia. It looks like a price for untamed overdeveloped predictive model.
I would say that any claims about “nature of effective intelligence” extrapolated from current LLMs are very speculative. What’s true is that something very weird is going on in brain, but we know that already.
Your interpretation of instruction tuning as corrigibility is wrong, it’s anything but. We train neural network to predict text, then we slightly tune its priors towards “if text is instruction its completion is following the instruction”. It’s like if we controlled ants by drawing traces with sugar—yes, ants will go by this traces and eat surfaces marked with sugar and eat their way towards more sugar where we place, but it does not lead to corrigible behavior in superintelligences. I think many-shot jailbreaks are sufficient to carry my point.
(Prompt: A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera. You can see person disappearing between 3rd and 5th seconds)
(Prompt: A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast… you can see people disappearing at 10th second)
The bitter lesson applies to alignment as well. Stop trying to think about “goal slots” whose circuit-level contents should be specified by the designers, or pining for a paradigm in which we program in a “utility function.” That isn’t how it works.
Hm, isn’t the (relative) success of activation engineering and related methods and findings (e.g. In-Context Learning Creates Task Vectors) some evidence against this view (at least taken very literally / to the extreme)? As in, shouldn’t task vectors seem very suprising under this view?
You seem to have ‘proven’ that evolution would use that exact method if it could, since evolution never looks forward and always must build on prior adaptations which provided immediate gain. By the same token, of course, evolution doesn’t have any knowledge, but if “knowledge” corresponds to any simple changes it could make, then that will obviously happen.
The bitter lesson applies to alignment as well. Stop trying to think about “goal slots” whose circuit-level contents should be specified by the designers, or pining for a paradigm in which we program in a “utility function.” That isn’t how it works. See:
the failure of the agent foundations research agenda;
the failed searches for “simple” safe wishes;
the successful instillation of (hitherto-seemingly unattainable) corrigibility by instruction finetuning (no hardcoding!);
the (apparent) failure of the evolved modularity hypothesis.
Don’t forget that hypothesis’s impact on classic AI risk! Notice how the following speculations about “explicit adaptations” violate information inaccessibility and also the bitter lesson of “online learning and search are. much more effective than hardcoded concepts and algorithms”:
From An Especially Elegant Evolutionary Psychology Experiment:
“Humans usually do notice sunk costs—this is presumably either an adaptation to prevent us from switching strategies too often (compensating for an overeager opportunity-noticer?) or an unfortunate spandrel of pain felt on wasting resources.”
“the parental grief adaptation”
“this selection pressure was not only great enough to fine-tune parental grief, but, in fact, carve it out of existence from scratch in the first place.”
“The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn’t spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.” (source)
“In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence Z.” (source)
“When, today, you get into an argument about whether “we” ought to raise the minimum wage, you’re executing adaptations for an ancestral environment where being on the wrong side of the argument could get you killed.”
Much of classical alignment theory violates now-known lessons about the nature of effective intelligence. These bitter lessons were taught to us by deep learning.
The reasons I don’t find this convincing:
The examples of human cognition you point to are the dumbest parts of human cognition. They are the parts we need to override in order to pursue non-standard goals. For example, in political arguments, the adaptations that we execute that make us attached to one position are bad. They are harmful to our goal of implementing effective policy. People who are good at finding effective government policy are good at overriding these adaptations.
“All these are part of the arbitrary, intrinsically-complex, outside world.” This seems wrong. The outside world isn’t that complex, and reflections of it are similarly not that complex. Hardcoding knowledge is a mistake, of course, but understanding a knowledge representation and updating process needn’t be that hard.
“They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.” I agree with this, but it’s also fairly obvious. The difficulty of alignment is building these in such a way that you can predict that they will continue to work, despite the context changes that occur as an AI scales up to be much more intelligent.
“These bitter lessons were taught to us by deep learning.” It looks to me like deep learning just gave most people an excuse to not think very much about how the machine is working on the inside. It became tractable to build useful machines without understanding why they worked.
It sounds like you’re saying that classical alignment theory violates lessons like “we shouldn’t hardcode knowledge, it should instead be learned by very general methods”. This is clearly untrue, but if this isn’t what you meant then I don’t understand the purpose of the last quote. Maybe a more charitable interpretation is that you think the lesson is “intelligence is irreducibly complex and it’s impossible to understand why it works”. But this is contradicted by the first quote. The meta-methods are a part of a mind that can and should be understood. And this is exactly the topic that much of agent foundations research has been about (with a particular focus on the aspects that are relevant to maintaining stability through context changes).
(My impression was that this is also what shard theory is trying to do, except with less focus on stability through context changes, much less emphasis on fully-general outcome-directedness, and more focus on high-level steering-of-plans-during-execution instead of the more traditional precise-specification-of-outcomes).
Current systems don’t have a goal slot, but neither are they agentic enough to be really useful. An explicit goal slot is highly useful when carrying out complex tasks that have subgoals. Humans definitely functionally have a “goal slot” although the way goals are selected and implemented is complex.
And it’s trivial to add a goal slot; with a highly intelligent LLM, one prompt called repeatedly will do:
Nonetheless, the bitter lesson is relevant: it should help to carefully choose the training set for the LLM “thought production”, as described in A “Bitter Lesson” Approach to Aligning AGI and ASI.
While the bitter lesson is somewhat relevant, selecting and interpreting goals seems likely to be the core consideration once we expand current network AI into more useful (and dangerous) agents.
I think point 4 is not very justified. For example, chicken have pretty much hardcoded object permanence, while Sora, being insanely good at video generation, struggles[1] with it.
My hypothesis here is it’s hard to learn object permanence by SGD, but very easy for evolution (you don’t have it, you die and after random search finds it, it spreads as far as possible).
The other example is that, apparently, cognitive specialization in human (and higher animals) brain got so far that neocortex is incapable to learn conditioned response, unlike cerebellum. Moreover, it’s not like neocortex is just unspecialized and cerebellum does this job better, patients with cerebellum atrophy simply don’t have conditioned response, period. I think this puts most analogies between brain and ANNs under very huge doubt.
My personal hypothesis here is that LLMs evolve “backwards” relatively to animals. Animals start as set of pretty simple homeostatic control algorithms and hardcoded sphexish behavioral programs and this sets hard design constraint on development of world-modeling parts of brain—getting flexibility and generality should not disrupt already existing proven neural mechanisms. Speculatively, we can guess that pure world modeling leads to expected known problems like “hallucinations”, so brain evolution is mostly directed by necessity to filter faulty outputs of the world model. For example, non-human animals reportedly don’t have schizophrenia. It looks like a price for untamed overdeveloped predictive model.
I would say that any claims about “nature of effective intelligence” extrapolated from current LLMs are very speculative. What’s true is that something very weird is going on in brain, but we know that already.
Your interpretation of instruction tuning as corrigibility is wrong, it’s anything but. We train neural network to predict text, then we slightly tune its priors towards “if text is instruction its completion is following the instruction”. It’s like if we controlled ants by drawing traces with sugar—yes, ants will go by this traces and eat surfaces marked with sugar and eat their way towards more sugar where we place, but it does not lead to corrigible behavior in superintelligences. I think many-shot jailbreaks are sufficient to carry my point.
(Prompt: A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera. You can see person disappearing between 3rd and 5th seconds)
(Prompt: A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast… you can see people disappearing at 10th second)
This reminds me of Moravec’s paradox.
Hm, isn’t the (relative) success of activation engineering and related methods and findings (e.g. In-Context Learning Creates Task Vectors) some evidence against this view (at least taken very literally / to the extreme)? As in, shouldn’t task vectors seem very suprising under this view?
You seem to have ‘proven’ that evolution would use that exact method if it could, since evolution never looks forward and always must build on prior adaptations which provided immediate gain. By the same token, of course, evolution doesn’t have any knowledge, but if “knowledge” corresponds to any simple changes it could make, then that will obviously happen.