However, I think humanity is going to build AGI without pausing for long enough to create the very most alignable type. And I think the limitations you mention almost all apply to any type of AGI we’ll realistically build first. (the limitations that are specific to LMAs mostly have analogous for any other network-based AGI I think). That’s why I’m concluding that LMAs are our best bet. Would you agree with that?
Who are “we”? I don’t think AI safety community is a coherent entity. OpenAI, Anthropic, and DeepMind seem to think it’s their (individually) best bet because they are afraid to lose the race to each other partially, and partially to some unscrupulous third-party actors who won’t worry about alignment at all. It’s clearly not the best bet for humanity, though. Humanity is obviously also not a coherent entity, but I agree that some people in it will probably try to build AGI no matter what, without much concern for safety.
I don’t see effective actions to this situation, though, apart from join big labs in their efforts (joining in their bet), which I’m not sure is net positive or net negative. There are responses that could be net positive but will require dozens of years to be implemented, if it could be successful at all, from reforming institutions and governance to making world’s (internet) infrastructure radically more safe, compartmentalised, distributed, and trust-based, which should tip the offense-defense balance towards defense. But given the timelines to AGI and then ASI (mine are very short, OpenAI’s don’t seem much longer either) these actions are not effective either.
By “we” I do mean the AI safety community, while understanding that not everyone will agree on the same course of action.
I think the AI safety community can have an effect outside of joining the big labs. If we as a community produced a better, more reliable approach to aligning AGI (of the type they’re building anyway), and it had a low alignment tax, the big labs would adopt that approach. So that’s what I’m trying to do.
Of course, such a safety plan would need to get enough visibility for the safety teams in the big orgs to know about it, but that’s their job so that’s a low bar.
I agree that large-scale changes of the type you describe will take too long for this route to AGI; but regulation could still play a role within the four-year timescale that OpenAI is talking about.
If we as a community produced a better, more reliable approach to aligning AGI (of the type they’re building anyway), and it had a low alignment tax, the big labs would adopt that approach.
How does anybody know that the alignment protocols that are outlined/sketched (we cannot really say “designed” or “engineered”, because invariably independent AI safety researchers stop way before that) on LW are “better”, without testing them with large-scale computation/training experiments, and/or interacting with the parameter weights of SoTA models such as GPT-4 directly (which these outside researchers don’t and won’t have access to)?
Just hypothesising about this or that alignment protocol or safety trick is not enough. Ideas are pretty cheap in this space, making actual realistic experiments is much rarer, doing hard engineering work to bring the idea from PoC to production is much harder and scarcer still. I’m sure people at OpenAI and other labs already sort of compete for the bottlenecked resource—the privilege to apply their alignment ideas to actual production systems like GPT-4.
I’m sure there are already a lot of internal competition and even politics for this. Assuming that outsiders can produce an alignment idea so marvellous that influential insiders will become enamored with the idea and will spend a lot of their political capital to bring the idea all the way to production is… a very tall order.
In addition, observe how within the language modelling paradigm of AGI and alignment, a lot of ideas seem cool or potentially helpful or promising, but not a single idea seems like an undeniable dunk (actually, this observation largely applies outside the language modelling paradigm as well). This is not a coincidence (longer story why, I will save it for a post that I will publish on this topic soon), and I think it will continue to be the case for any new ideas within this paradigm. This observation makes even more improbable that somebody will fortuitously stumble upon an alignment idea apparently so much stronger than any ideas that have been entertained before to compel AGI labs to adopt this idea on a large scale. I don’t think there is sufficient gradient/contrast of “idea strength” anywhere in the space of LM alignment, at all.
This seems like a very pessimistic take on the whole alignment project. If you’re right, we’re all dead. I’d prefer to assume that there are such things as good ideas, and that they have some sway in the face of politics and the difficulty of doing theory about a type of system that isn’t yet implemented.
I see a pretty clear gradient of idea strength in alignment. There are good ideas that apply to systems we’re not building, and there are decent ideas about aligning the types of AGI we’re actually making rapid progress on, namely RL agents and language models.
I’m not talking about a hypothetical slam-dunk future idea. I don’t think we’ll get one, because the AGI we’re developing is complex. There will be no certain proofs of alignment. I’m talking about the set of ideas in this post.
I’ll post two sections from the post that I’m planning because I’m not sure when I will summon the will to post it in full.
1. AI safety and alignment fields are theoretical “swamps”
Unlike classical mechanics, thermodynamics, optics, electromagnetics, chemistry, and other branches of natural science that are the basis of “traditional” engineering, AI (safety) engineering science is troubled by the fact that neural networks (both natural or artificial) are complex systems and therefore a scientist (i.e., a modeller) can “find” a lot of different theories within the dynamics of neural nets. Hence the proliferation of theories of neural networks, (value) learning, and cognition: https://deeplearningtheory.com/, https://transformer-circuits.pub/, https://arxiv.org/abs/2210.13741, singular learning theory, shard theory, and many, many other theories.
This has important implications:
No single theory is “completely correct”: the behaviour of neural net may be just not very “compressible” (computationally reducible, in Wolfram’s terms). Different theories “fail” (i.e., incorrectly predict the behaviour of the NN, or couldn’t make a prediction) in different aspects of the behaviour and in different contexts.
Therefore, different theories could perhaps be at best partially or “fuzzily” ordered in terms of their quality and predictive power, or maybe some of these theories couldn’t be ordered at all.
2. Independent AI safety research is totally ineffective for affecting the trajectory of AGI development at major labs
Considering the above, choosing a particular theory as the basis for AI engineering, evals, monitoring, and anomaly detection at AGI labs becomes a matter of:
Availability: which theory is already developed, and there is an expertise in this theory among scientists in a particular AGI lab?
Convenience: which theory is easy to apply to (or “read into”) the current SoTA AI architectures? For example, auto-regressive LLMs greatly favour “surface linguistic” theories and processes of alignment such as RLHF or Constitutional AI and don’t particularly favour theories of alignment that analyse AI’s “conceptual beliefs” and their (Bayesian) “states of mind”.
Research and engineering taste of the AGI lab’s leaders, as well as their intuitions: which theory of intelligence/agency seems most “right” to them?
At the same time, the choice of theories of cognition and (process) theories of alignment is biased by political and economic/competitive pressures (cf. the alignment tax).
For example, any theory that predicts that the current SoTA AIs are already significantly conscious and therefore AGI labs should apply the commensurate standards of ethics to training and deployment of these systems would be both politically unpopular (because the public doesn’t generally like widening the circle of moral concern and does so very slowly and grudgingly, while altering the political systems to give rights to AIs is a nightmare for the current political establishment) and economically/competitively unpopular (because this could stifle the AGI development and the integration of AGIs into the economy, which will likely give way to even less scrupulous actors, from countries and corporations to individual hackers). These huge pressures against such theories of AI consciousness will very likely lead to writing them off at the major AGI labs as “unproven” or “unconvincing”.
In this environment, it’s very hard to see how an independent AI safety researcher could scaffold a theory so impressive that some AGI lab will decide to adopt it, which may demand scrapping the works that took already hundreds of millions of dollars to produce (i.e., auto-regressive LLMs). I can imagine this could happen only if there is extraordinary momentum and excitement with a certain theory of cognition, agency, consciousness, or neural networks in the academic community. But achieving such a high level of enthusiasm about one specific theory seems just impossible because, as pointed above, in AI science and cognitive science, a lot of different theories seem to “capture the truth” to some degree but at the same time, but no theory could capture it so strikingly and so much better than other theories that the theory will generate a reaction in the scientific and AGI development community stronger than “nice, this seems plausible, good work, but we will carry own with our own favourite theories and approaches”[footnote: I wonder what was the last theory in any science that gained this level of universal, “consensus” acceptance within its field relatively quickly. Dawkins’ theory of selfish genes in evolutionary biology, perhaps?].
Thus, it seems to me that large paradigm shifts in AGI engineering could only be driven by demonstrably superior capability (or training/learning efficiency, or inference efficiency) that would compel the AGI labs to switch for economic and competitive reasons, again. It doesn’t seem that purely theoretical or philosophical considerations in such a “theoretically swampy” fields as cognitive science, consciousness, and (AI) ethics could generate nearly sufficient motivation for AGI labs to change their course of action, even in principle.
I’m not talking about a hypothetical slam-dunk future idea. I don’t think we’ll get one, because the AGI we’re developing is complex. There will be no certain proofs of alignment. I’m talking about the set of ideas in this post.
As I said, ideas about LLM (and LMA) alignment are cheap. We can generate lots of them: special training data sequencing and curation (aka “raise AI like a child”), feedback during pre-training, fine-tuning or RL after pre-training, debate, internal review, etc. The question is how many of these ideas should be implemented in production pipeline: 5? 50? All ideas that LW authors could possibly come up with? The problem is, that each of these “ideas” should be supported in production, possibly by the entire team of people, as well as incur compute cost and higher latency (that worsens the user experience). Also, who should implement these ideas? All leading labs that develop SoTA LMAs? Open-source LMA developers, too?
And yes, I think it’s a priori hard and perhaps often impossible to judge how will this or that LMA alignment idea work at scale.
Who are “we”? I don’t think AI safety community is a coherent entity. OpenAI, Anthropic, and DeepMind seem to think it’s their (individually) best bet because they are afraid to lose the race to each other partially, and partially to some unscrupulous third-party actors who won’t worry about alignment at all. It’s clearly not the best bet for humanity, though. Humanity is obviously also not a coherent entity, but I agree that some people in it will probably try to build AGI no matter what, without much concern for safety.
I don’t see effective actions to this situation, though, apart from join big labs in their efforts (joining in their bet), which I’m not sure is net positive or net negative. There are responses that could be net positive but will require dozens of years to be implemented, if it could be successful at all, from reforming institutions and governance to making world’s (internet) infrastructure radically more safe, compartmentalised, distributed, and trust-based, which should tip the offense-defense balance towards defense. But given the timelines to AGI and then ASI (mine are very short, OpenAI’s don’t seem much longer either) these actions are not effective either.
By “we” I do mean the AI safety community, while understanding that not everyone will agree on the same course of action.
I think the AI safety community can have an effect outside of joining the big labs. If we as a community produced a better, more reliable approach to aligning AGI (of the type they’re building anyway), and it had a low alignment tax, the big labs would adopt that approach. So that’s what I’m trying to do.
Of course, such a safety plan would need to get enough visibility for the safety teams in the big orgs to know about it, but that’s their job so that’s a low bar.
I agree that large-scale changes of the type you describe will take too long for this route to AGI; but regulation could still play a role within the four-year timescale that OpenAI is talking about.
How does anybody know that the alignment protocols that are outlined/sketched (we cannot really say “designed” or “engineered”, because invariably independent AI safety researchers stop way before that) on LW are “better”, without testing them with large-scale computation/training experiments, and/or interacting with the parameter weights of SoTA models such as GPT-4 directly (which these outside researchers don’t and won’t have access to)?
Just hypothesising about this or that alignment protocol or safety trick is not enough. Ideas are pretty cheap in this space, making actual realistic experiments is much rarer, doing hard engineering work to bring the idea from PoC to production is much harder and scarcer still. I’m sure people at OpenAI and other labs already sort of compete for the bottlenecked resource—the privilege to apply their alignment ideas to actual production systems like GPT-4.
I’m sure there are already a lot of internal competition and even politics for this. Assuming that outsiders can produce an alignment idea so marvellous that influential insiders will become enamored with the idea and will spend a lot of their political capital to bring the idea all the way to production is… a very tall order.
In addition, observe how within the language modelling paradigm of AGI and alignment, a lot of ideas seem cool or potentially helpful or promising, but not a single idea seems like an undeniable dunk (actually, this observation largely applies outside the language modelling paradigm as well). This is not a coincidence (longer story why, I will save it for a post that I will publish on this topic soon), and I think it will continue to be the case for any new ideas within this paradigm. This observation makes even more improbable that somebody will fortuitously stumble upon an alignment idea apparently so much stronger than any ideas that have been entertained before to compel AGI labs to adopt this idea on a large scale. I don’t think there is sufficient gradient/contrast of “idea strength” anywhere in the space of LM alignment, at all.
This seems like a very pessimistic take on the whole alignment project. If you’re right, we’re all dead. I’d prefer to assume that there are such things as good ideas, and that they have some sway in the face of politics and the difficulty of doing theory about a type of system that isn’t yet implemented.
I see a pretty clear gradient of idea strength in alignment. There are good ideas that apply to systems we’re not building, and there are decent ideas about aligning the types of AGI we’re actually making rapid progress on, namely RL agents and language models.
I’m not talking about a hypothetical slam-dunk future idea. I don’t think we’ll get one, because the AGI we’re developing is complex. There will be no certain proofs of alignment. I’m talking about the set of ideas in this post.
I’ll post two sections from the post that I’m planning because I’m not sure when I will summon the will to post it in full.
1. AI safety and alignment fields are theoretical “swamps”
Unlike classical mechanics, thermodynamics, optics, electromagnetics, chemistry, and other branches of natural science that are the basis of “traditional” engineering, AI (safety) engineering science is troubled by the fact that neural networks (both natural or artificial) are complex systems and therefore a scientist (i.e., a modeller) can “find” a lot of different theories within the dynamics of neural nets. Hence the proliferation of theories of neural networks, (value) learning, and cognition: https://deeplearningtheory.com/, https://transformer-circuits.pub/, https://arxiv.org/abs/2210.13741, singular learning theory, shard theory, and many, many other theories.
This has important implications:
No single theory is “completely correct”: the behaviour of neural net may be just not very “compressible” (computationally reducible, in Wolfram’s terms). Different theories “fail” (i.e., incorrectly predict the behaviour of the NN, or couldn’t make a prediction) in different aspects of the behaviour and in different contexts.
Therefore, different theories could perhaps be at best partially or “fuzzily” ordered in terms of their quality and predictive power, or maybe some of these theories couldn’t be ordered at all.
2. Independent AI safety research is totally ineffective for affecting the trajectory of AGI development at major labs
Considering the above, choosing a particular theory as the basis for AI engineering, evals, monitoring, and anomaly detection at AGI labs becomes a matter of:
Availability: which theory is already developed, and there is an expertise in this theory among scientists in a particular AGI lab?
Convenience: which theory is easy to apply to (or “read into”) the current SoTA AI architectures? For example, auto-regressive LLMs greatly favour “surface linguistic” theories and processes of alignment such as RLHF or Constitutional AI and don’t particularly favour theories of alignment that analyse AI’s “conceptual beliefs” and their (Bayesian) “states of mind”.
Research and engineering taste of the AGI lab’s leaders, as well as their intuitions: which theory of intelligence/agency seems most “right” to them?
At the same time, the choice of theories of cognition and (process) theories of alignment is biased by political and economic/competitive pressures (cf. the alignment tax).
For example, any theory that predicts that the current SoTA AIs are already significantly conscious and therefore AGI labs should apply the commensurate standards of ethics to training and deployment of these systems would be both politically unpopular (because the public doesn’t generally like widening the circle of moral concern and does so very slowly and grudgingly, while altering the political systems to give rights to AIs is a nightmare for the current political establishment) and economically/competitively unpopular (because this could stifle the AGI development and the integration of AGIs into the economy, which will likely give way to even less scrupulous actors, from countries and corporations to individual hackers). These huge pressures against such theories of AI consciousness will very likely lead to writing them off at the major AGI labs as “unproven” or “unconvincing”.
In this environment, it’s very hard to see how an independent AI safety researcher could scaffold a theory so impressive that some AGI lab will decide to adopt it, which may demand scrapping the works that took already hundreds of millions of dollars to produce (i.e., auto-regressive LLMs). I can imagine this could happen only if there is extraordinary momentum and excitement with a certain theory of cognition, agency, consciousness, or neural networks in the academic community. But achieving such a high level of enthusiasm about one specific theory seems just impossible because, as pointed above, in AI science and cognitive science, a lot of different theories seem to “capture the truth” to some degree but at the same time, but no theory could capture it so strikingly and so much better than other theories that the theory will generate a reaction in the scientific and AGI development community stronger than “nice, this seems plausible, good work, but we will carry own with our own favourite theories and approaches”[footnote: I wonder what was the last theory in any science that gained this level of universal, “consensus” acceptance within its field relatively quickly. Dawkins’ theory of selfish genes in evolutionary biology, perhaps?].
Thus, it seems to me that large paradigm shifts in AGI engineering could only be driven by demonstrably superior capability (or training/learning efficiency, or inference efficiency) that would compel the AGI labs to switch for economic and competitive reasons, again. It doesn’t seem that purely theoretical or philosophical considerations in such a “theoretically swampy” fields as cognitive science, consciousness, and (AI) ethics could generate nearly sufficient motivation for AGI labs to change their course of action, even in principle.
As I said, ideas about LLM (and LMA) alignment are cheap. We can generate lots of them: special training data sequencing and curation (aka “raise AI like a child”), feedback during pre-training, fine-tuning or RL after pre-training, debate, internal review, etc. The question is how many of these ideas should be implemented in production pipeline: 5? 50? All ideas that LW authors could possibly come up with? The problem is, that each of these “ideas” should be supported in production, possibly by the entire team of people, as well as incur compute cost and higher latency (that worsens the user experience). Also, who should implement these ideas? All leading labs that develop SoTA LMAs? Open-source LMA developers, too?
And yes, I think it’s a priori hard and perhaps often impossible to judge how will this or that LMA alignment idea work at scale.