Let’s assume that GPT-5 will hit the AGI recursive self-improvement bar, so if it is bundled in an AutoGPT-style loop and tasked with a bold task such as “solve a big challenge of humanity, such as global hunger” or “make a billion-dollar business” it could meaningfully achieve a lot, via instrumental convergence and actual agency, including self-improvement, at which point it is effectively out of control.
I think that “build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + transparency tools + mulligans” won’t be enough to align such an AGI “once and forever”, and that actually trying to task thus-”aligned” AGI with a bold task won’t end well overall (even if it will happen to solve the given task.
But I do think that this will be enough to produce [epistemology, alignment, governance, infosec, infrastructure design, etc.] science safely within AGI labs with a very high probability, > 90%. By “safely producing science” I mean (1) Not spinning out of control while doing (largely or exclusively) theoretical science where any non-theoretical action, such as “let’s run a such-and-such simulation with such-and-such programmed actors”, or “let’s train such-and-such NN to test the such-and-such theory of representation learning” is supervised by human researchers, (2) Not being able to sneak “backdoors” or at least serious self-serving biases in the produced epistemology, alignment, and governance science that human scientists won’t be able to spot and correct when their review the AI outputs. (Caveat: this probably doesn’t apply to ethics-as-philosophy, as I discussed here, so ethics should be practised as a science as well, hence my references to ethical naturalism.)
Then, I detailed my disagreement with Shimi here. For alignment, we will need to apply multiple frameworks/paradigms simultaneously, and “RLHF + language feedback + debate/critiques + scalable oversight” is just one of these paradigms, which I called “linguistic process theories of alignment” in the linked post. We will need to apply several more paradigms simultaneously and continuously, throughout the training, deployment, and operations phase of AIs, such as Bayesian-style “world model alignment”, LeCun’s “emotional drives”, game-theoretic and mechanism-design-based process approaches (designing the right incentives for AIs during their development and deployment), (cooperative, inverse) RL paradigm, etc.
Specifically, I don’t think that if we apply all the above-mentioned approaches and a few more (which is, of course, itself not easy at all, as there are serious theoretical and practical issues with most of them, such as computational tractability or too high alignment tax, but I believe these problems are soluble, and the tax could be reduced by developing smart amortization techniques), we are still “doomed” because there is a glaring theoretical or strategic “hole” that must be closed with some totally new, totally unheard of or unexpected conceptual approach to alignment. As I described in the post, I think about this in terms of “the percent of the complexity of human values covered” with a portfolio of approaches, and when this portfolio is bigger than 3-4 approaches, almost all complexity of value is covered and it’s even not that important what specific approaches in this portfolio, as long as they are scientifically sound. That’s why I advocate marshalling most efforts behind existing approaches: it’s not that cooperative inverse RL, RLHF + language feedback, or Active Inference may turn out to be “totally useless”, and therefore working on maturing these approaches and turning them into practical process theories of alignment may be wasted effort. These approaches will be useful if they are applied, and naturally approaches that already have some momentum behind them are more likely to be applied.
Dealing with unaligned competition
You mentioned one potential “strategical hole” in the picture that I outlined above: an unaligned AI may have a competitive advantage over an AI aligned with a “portfolio of approaches” (at least the presently existing approaches), so, (I assume your argument goes,) we need an HRAD-style invention of “super capable, but also super corrigible” AI architecture. I’m personally doubtful that
But I’m not confident about either of these three propositions, so I should leave about 10% that they are all three false and hence that your strategic concern is valid.
However, I think we should address this strategic concern by rewiring the economic and action landscapes (which also interacts with the “game-theoretic, mechanism-design” alignment paradigm mentioned above). The current (internet) infrastructure and economic systems are not prepared for the emergence of powerful adversarial agents at all:
There are no systems of trust and authenticity verification at the root of internet communication (see https://trustoverip.org/)
The storage of information is centralised enormously (primarily in the data centres of BigCos such as Google, Meta, etc.)
Money has no trace, so one may earn money in arbitrary malicious or unlawful ways (i.e., gain instrumental power) and then use it to acquire resources from respectable places, e.g., paying for ML training compute at AWS or Azure and purchasing data from data providers. Formal regulations such as compute governance and data governance and human-based KYC procedures can only go so far and could probably be social-engineered by a superhuman imposter or persuader AI.
In essence, we want to design civilisational cooperation systems such that being aligned is a competitive advantage. Cf. “The Gaia Attractor” by Rafael Kaufmann.
This is a very ambitious program to rewire the entirety of the internet, other infrastructure, and the economy, but I believe this must be done anyway, just expecting a “miracle” HRAD invention to be sufficient without fixing the infrastructure and system design layers doesn’t sound like a good strategy. By the way, such infrastructure and economy rewiring is the real “pivotal act”.
I think we agree on a lot more than I realized! In particular, I don’t disagree with your general claims about pathways to HRAD through Alignment MVPs (though I hold some credence that this might not work). Things I disagree with:
I generally disagree with the claim “alignment approaches don’t limit agentic capability.” This was one subject of my independent research before I started directing MATS. Hopefully, I can publish some high-level summaries soon, time permitting! In short, I think “aligning models” generally trades off bits of optimization pressure with “making models performance-competitive,” which makes building aligned models less training-competitive for a given degree of performance.
I generally disagree with the claim “corrigibility is not a useful, coherent concept.” I think there is a (narrow) attractor basin around “corrigibility” in cognition space. Happy to discuss more and possibly update.
I generally disagree with the claim “provably-controllable highly reliable agent design is impossible in principle.” I think it is possible to design recursively self-improving programs that are robust to adversarial inputs, even if this is vanishingly hard in practice (which informs my sense of alignment difficulty only insomuch as I hope we don’t hit that attractor well before CEV value-loading is accomplished). Happy to discuss and possibly update.
I generally disagree with the implicit claim “it’s useful to try aligning AI systems via mechanism design on civilization.” This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent. I also don’t think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
I generally disagree with the implicit claim “it’s useful to try aligning AI systems via mechanism design on civilization.” This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent.
I didn’t imply that mechanism/higher-level-system design and game theory are alone sufficient for successful alignment. But as a part of a portfolio, I think it’s indispensable.
Probably the degree to which a person (let’s say, you or me) buy into the importance of mechanism/higher-level-system design for AI alignment corresponds to where we also land on the cognitivism—enactivism spectrum. If you are a hardcore cognitivist, you may think that just designing and training AI “in the right way” would be 100% sufficient. If you are a radical enactivist, you probably endorse the opposite view, that just designing incentives is necessary and sufficient, while designing and training AI “in the right way” is futile if the correct incentives are not in place.
I’m personally very uncertain about the cognitivism—enactivism spectrum (as well as, apparently, the scientific community—there are completely opposite positions out there held by respectable cognitive scientists), so from the “meta” rationality perspective, I should hedge bets and be sure not to neglect neither the cognitive representations and AI architecture nor the incentives and the environment.
I also don’t think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
Pre-AGI, I think there won’t be enough collective willingness to upend the economy and institute the “right” structures, anyway. The most realistic path (albeit still with a lot of failure modes) that I see is “alignment MVP tells us (or helps scientists to develop the necessary science) how the society, markets, and governance should really be structured, along with the AI architectures and training procedures for the next-gen AGI, which convinces scientists and decision-makers around the world”.
The difference is that after alignment MVP, first the sceptic voice “Let’s keep things as they always used to be, it’s all hype, there is no intelligence” should definitely cease completely at least among intelligent people. Second, alignment MVP should show that ending scarcity is a real possibility and this should weaken the grip of status quo economic and political incentives. Again, a lot of things could go wrong around this time, but it seems to me this is the path where OpenAI, Conjecture, and probably other AGI labs are aiming because they perceive it as the least risky or the only feasible one.
I’m somewhere in the middle of the cognitivist/enactivist spectrum. I think that e.g. relaxed adversarial training is motivated by trying to make an AI robust to arbitrary inputs it will receive in the world before it leaves the box. I’m sympathetic to the belief that this is computationally intractable; however, it feels more achievable than altering the world in the way I imagine would be necessary without it.
I’m not an idealist here: I think that some civilizational inadequacies should be addressed (e.g., better cooperation and commitment mechanisms) concurrent with in-the-box alignment strategies. My main hope is that we can build an in-the-box corrigible AGI that allows in-deployment modification.
Seems to me that you misunderstood my position.
Let’s assume that GPT-5 will hit the AGI recursive self-improvement bar, so if it is bundled in an AutoGPT-style loop and tasked with a bold task such as “solve a big challenge of humanity, such as global hunger” or “make a billion-dollar business” it could meaningfully achieve a lot, via instrumental convergence and actual agency, including self-improvement, at which point it is effectively out of control.
I think that “build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + transparency tools + mulligans” won’t be enough to align such an AGI “once and forever”, and that actually trying to task thus-”aligned” AGI with a bold task won’t end well overall (even if it will happen to solve the given task.
But I do think that this will be enough to produce [epistemology, alignment, governance, infosec, infrastructure design, etc.] science safely within AGI labs with a very high probability, > 90%. By “safely producing science” I mean
(1) Not spinning out of control while doing (largely or exclusively) theoretical science where any non-theoretical action, such as “let’s run a such-and-such simulation with such-and-such programmed actors”, or “let’s train such-and-such NN to test the such-and-such theory of representation learning” is supervised by human researchers,
(2) Not being able to sneak “backdoors” or at least serious self-serving biases in the produced epistemology, alignment, and governance science that human scientists won’t be able to spot and correct when their review the AI outputs. (Caveat: this probably doesn’t apply to ethics-as-philosophy, as I discussed here, so ethics should be practised as a science as well, hence my references to ethical naturalism.)
Then, I detailed my disagreement with Shimi here. For alignment, we will need to apply multiple frameworks/paradigms simultaneously, and “RLHF + language feedback + debate/critiques + scalable oversight” is just one of these paradigms, which I called “linguistic process theories of alignment” in the linked post. We will need to apply several more paradigms simultaneously and continuously, throughout the training, deployment, and operations phase of AIs, such as Bayesian-style “world model alignment”, LeCun’s “emotional drives”, game-theoretic and mechanism-design-based process approaches (designing the right incentives for AIs during their development and deployment), (cooperative, inverse) RL paradigm, etc.
Specifically, I don’t think that if we apply all the above-mentioned approaches and a few more (which is, of course, itself not easy at all, as there are serious theoretical and practical issues with most of them, such as computational tractability or too high alignment tax, but I believe these problems are soluble, and the tax could be reduced by developing smart amortization techniques), we are still “doomed” because there is a glaring theoretical or strategic “hole” that must be closed with some totally new, totally unheard of or unexpected conceptual approach to alignment. As I described in the post, I think about this in terms of “the percent of the complexity of human values covered” with a portfolio of approaches, and when this portfolio is bigger than 3-4 approaches, almost all complexity of value is covered and it’s even not that important what specific approaches in this portfolio, as long as they are scientifically sound. That’s why I advocate marshalling most efforts behind existing approaches: it’s not that cooperative inverse RL, RLHF + language feedback, or Active Inference may turn out to be “totally useless”, and therefore working on maturing these approaches and turning them into practical process theories of alignment may be wasted effort. These approaches will be useful if they are applied, and naturally approaches that already have some momentum behind them are more likely to be applied.
Dealing with unaligned competition
You mentioned one potential “strategical hole” in the picture that I outlined above: an unaligned AI may have a competitive advantage over an AI aligned with a “portfolio of approaches” (at least the presently existing approaches), so, (I assume your argument goes,) we need an HRAD-style invention of “super capable, but also super corrigible” AI architecture. I’m personally doubtful that
alignment approaches actually limit agentic capability,
corrigibility is a useful, coherent concept (as I mentioned in this comment), and
provably-controllable highly reliable agent design is possible in principle: the law of requisite variety could demand AI to be “complex and messy” in order to be effective in the “complex and messy” world.
But I’m not confident about either of these three propositions, so I should leave about 10% that they are all three false and hence that your strategic concern is valid.
However, I think we should address this strategic concern by rewiring the economic and action landscapes (which also interacts with the “game-theoretic, mechanism-design” alignment paradigm mentioned above). The current (internet) infrastructure and economic systems are not prepared for the emergence of powerful adversarial agents at all:
There are no systems of trust and authenticity verification at the root of internet communication (see https://trustoverip.org/)
The storage of information is centralised enormously (primarily in the data centres of BigCos such as Google, Meta, etc.)
Money has no trace, so one may earn money in arbitrary malicious or unlawful ways (i.e., gain instrumental power) and then use it to acquire resources from respectable places, e.g., paying for ML training compute at AWS or Azure and purchasing data from data providers. Formal regulations such as compute governance and data governance and human-based KYC procedures can only go so far and could probably be social-engineered by a superhuman imposter or persuader AI.
In essence, we want to design civilisational cooperation systems such that being aligned is a competitive advantage. Cf. “The Gaia Attractor” by Rafael Kaufmann.
This is a very ambitious program to rewire the entirety of the internet, other infrastructure, and the economy, but I believe this must be done anyway, just expecting a “miracle” HRAD invention to be sufficient without fixing the infrastructure and system design layers doesn’t sound like a good strategy. By the way, such infrastructure and economy rewiring is the real “pivotal act”.
I think we agree on a lot more than I realized! In particular, I don’t disagree with your general claims about pathways to HRAD through Alignment MVPs (though I hold some credence that this might not work). Things I disagree with:
I generally disagree with the claim “alignment approaches don’t limit agentic capability.” This was one subject of my independent research before I started directing MATS. Hopefully, I can publish some high-level summaries soon, time permitting! In short, I think “aligning models” generally trades off bits of optimization pressure with “making models performance-competitive,” which makes building aligned models less training-competitive for a given degree of performance.
I generally disagree with the claim “corrigibility is not a useful, coherent concept.” I think there is a (narrow) attractor basin around “corrigibility” in cognition space. Happy to discuss more and possibly update.
I generally disagree with the claim “provably-controllable highly reliable agent design is impossible in principle.” I think it is possible to design recursively self-improving programs that are robust to adversarial inputs, even if this is vanishingly hard in practice (which informs my sense of alignment difficulty only insomuch as I hope we don’t hit that attractor well before CEV value-loading is accomplished). Happy to discuss and possibly update.
I generally disagree with the implicit claim “it’s useful to try aligning AI systems via mechanism design on civilization.” This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent. I also don’t think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
I didn’t imply that mechanism/higher-level-system design and game theory are alone sufficient for successful alignment. But as a part of a portfolio, I think it’s indispensable.
Probably the degree to which a person (let’s say, you or me) buy into the importance of mechanism/higher-level-system design for AI alignment corresponds to where we also land on the cognitivism—enactivism spectrum. If you are a hardcore cognitivist, you may think that just designing and training AI “in the right way” would be 100% sufficient. If you are a radical enactivist, you probably endorse the opposite view, that just designing incentives is necessary and sufficient, while designing and training AI “in the right way” is futile if the correct incentives are not in place.
I’m personally very uncertain about the cognitivism—enactivism spectrum (as well as, apparently, the scientific community—there are completely opposite positions out there held by respectable cognitive scientists), so from the “meta” rationality perspective, I should hedge bets and be sure not to neglect neither the cognitive representations and AI architecture nor the incentives and the environment.
Pre-AGI, I think there won’t be enough collective willingness to upend the economy and institute the “right” structures, anyway. The most realistic path (albeit still with a lot of failure modes) that I see is “alignment MVP tells us (or helps scientists to develop the necessary science) how the society, markets, and governance should really be structured, along with the AI architectures and training procedures for the next-gen AGI, which convinces scientists and decision-makers around the world”.
The difference is that after alignment MVP, first the sceptic voice “Let’s keep things as they always used to be, it’s all hype, there is no intelligence” should definitely cease completely at least among intelligent people. Second, alignment MVP should show that ending scarcity is a real possibility and this should weaken the grip of status quo economic and political incentives. Again, a lot of things could go wrong around this time, but it seems to me this is the path where OpenAI, Conjecture, and probably other AGI labs are aiming because they perceive it as the least risky or the only feasible one.
I’m somewhere in the middle of the cognitivist/enactivist spectrum. I think that e.g. relaxed adversarial training is motivated by trying to make an AI robust to arbitrary inputs it will receive in the world before it leaves the box. I’m sympathetic to the belief that this is computationally intractable; however, it feels more achievable than altering the world in the way I imagine would be necessary without it.
I’m not an idealist here: I think that some civilizational inadequacies should be addressed (e.g., better cooperation and commitment mechanisms) concurrent with in-the-box alignment strategies. My main hope is that we can build an in-the-box corrigible AGI that allows in-deployment modification.