I think there should be much more seminars about philosophy of science, methodology of science, and strategy of research in the program. Perhaps, not at the expense of other seminar content, but via increasing the number of seminar hours. Talks about ethics, and in particular about ethical naturalism (such as Oliver Scott Curry’s “Morality as Cooperation”), or interaction of ethics and science more generally, also seem essential.
MATS is currently focused on developing scholars as per the “T-model of research” with three main levers:
In the Winter 2022-23 Cohort, we ran several research strategy workshops (focusing on problem decomposition and strategy identification) and had dedicated scholar support staff who offered regular, airgapped 1-1 support for research strategy and researcher unblocking. We will publish some findings from our Scholar Support Post-Mortem soon. We plan to run further research strategy and “originality” workshops in the Summer 2023 Cohort and are currently reviewing our curriculum from the ground up.
Currently, we are not focused on “ethical naturalism” or similar as a curriculum element as it does not seem broadly useful for our cohort compared to project-specific, scholar-driven self-education. We are potentially open to hosting a seminar on this topic.
In terms of pointing program members to a list of scientific disciplines that are relevant to AI safety (corresponds to the “Agency and Alignment ‘Major’” section in the linked post), I’d propose also pointing to this list which is much fuller and has many references to relevant recent literature.
This doesn’t apply to the aspects of ensuring favourable AI transition of the civilisation “goes well” that are other than alignment, such as anomaly detection, resilience, information security implications, control and curation of the memetic space, post-AGI governance (of everything), etc., as well as theories of change that you cover in another comment. For these things, almost as much originality as possible seems to be good.
In the Winter 2022-23 Cohort, we ran several research strategy workshops (focusing on problem decomposition and strategy identification) and had dedicated scholar support staff who offered regular, airgapped 1-1 support for research strategy and researcher unblocking. We will publish some findings from our Scholar Support Post-Mortem soon. We plan to run further research strategy and “originality” workshops in the Summer 2023 Cohort and are currently reviewing our curriculum from the ground up.
This is good to hear. Sounds to me like you were already doing most of what I’m suggesting, though this was not visible from the outside. Looking forward to reading the post-mortem.
I still think it would be very valuable if you invited scientists like Thomas Metzinger or Andy Clark to give talks on philosophy of cognitive science and alignment, prompting them with a question like “What do you think alignment/AI safety researchers should understand about philosophy of (cognitive) science?”
Thank you for recommending your study guide; it looks quite interesting.
MATS does not endorse “maximizing originality” in our curriculum. We believe that good original research in AI safety comes from a combination of broad interdisciplinary knowledge, deep technical knowledge, and strong epistemological investigation, which is why we emphasize all three. I’m a bit confused by your reference to Adam’s post. I interpret his post as advocating for more originality, not less, in terms of diverse alignment research agendas.
I think that some of the examples you gave of “non-alignment” research areas are potentially useful subproblems for what I term “impact alignment.” For example, strong anomaly detection (e.g., via mechanistic interpretability, OOD warning lights, or RAT-style acceptability verification) can help ensure “inner alignment” (e.g., through assessing corrigibility or myopia) and infosec/governance/meme-curation can help ensure “outer alignment,” “paying the alignment tax,” and mitigating “vulnerable world” situations with stolen/open sourced weaponizable models. I think this inner/outer distinction is a useful framing, though not the only way to carve reality at the joints, of course.
I think Metzinger or Clark could give interesting seminars, though I’m generally worried about encouraging the anthropomorphism of AGI. I like the “shoggoth” or “alien god” memetic framings of AGI, as these (while wrong) permit a superset of human-like behavior without restricting assumptions of model internals to (unlikely and optimistic, imo) human-like cognition. In this vein, I particularly like Steve Byrnes’ research as I feel it doesn’t overly anthropomorphize AGI and encourages the “competent psychopath with alien goals” memetic framing. I’m intrigued by this suggestion, however. How do you think Metzinger or Clark would specifically benefit our scholars?
(Note: I’ve tried to differentiate between MATS’ organizational position and mine by using “I” or “we” when appropriate.)
I’m a bit confused by your reference to Adam’s post. I interpret his post as advocating for more originality, not less, in terms of diverse alignment research agendas.
The quote by Shimi exemplifies the position with which I disagree. I believe Yudkowsky and Soares were also stating something along these lines previously, but I couldn’t find suitable quotes. I don’t know if any of these three people still hold this position, though.
How do you think Metzinger or Clark would specifically benefit our scholars?
I heard Metzinger reflecting on the passage of paradigms in cognitive science (here) so I think he would generally endorse the idea that I expressed here (I don’t claim though that I understand his philosophy perfectly and improve it; if I understood his exact version of this philosophy and how exactly it differs from mine I would likely change mine).
For Clark, I’m not sure what he would actually answer to the prompt like “What do you think are the blind spots or thinking biases of young AI (alignment, safety) researchers, what would you wish them to understand?”. Maybe his answer would be “I think their views are reasonable, I don’t see particular blind spots”. But also maybe not. This is more of an open-ended question which I cannot answer because I am myself probably susceptible to these exact blind spots or biases if he sees any from the “outside”.
Alignment research obviously needs to synthesise a deep understanding and steering of AIs with a deep understanding and steering (aka prompt engineering) of humans: the principles of human cognition and the nature of human values. So being exposed even to a rather “anthropocentric” philosophy of human cognition/values would be insightful and valuable for alignment researchers (as long as it is anthropocentric in a sound way, of course). Yet I think Clark’s philosophy is certainly more substrate- and mechanism-independent.
I don’t disagree with Shimi as strongly as you do. I think there’s some chance we need radically new paradigms of aligning AI than “build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + transparency tools + mulligans.”
While I do endorse some anthropocentric “value-loading”-based alignment strategies in my portfolio, such as Shard Theory and Steve Byrnes’ research, I worry about overly investing in anthropocentric AGI alignment strategies. I don’t necessarily think that RLHF shapes GPT-N in a manner similar to how natural selection and related processes shaped humans to be altruistic. I think it’s quite likely that the kind of cognition that GPT-N learns to predict tokens is more akin to an “alien god” than it is to human cognition. I think that trying to value-load an alien god is pretty hard.
In general, I don’t highly endorse the framing of alignment as “making AIs more human.” I think this kind of approach fails in some worlds and might produce models that are not performance-competitive enough to outcompete the unaligned models others deploy. I’d rather produce corrigible models with superhuman cognition coupled with robust democratic institutions. Nevertheless, I endorse at least some research along this line, but this is not the majority of my portfolio.
Let’s assume that GPT-5 will hit the AGI recursive self-improvement bar, so if it is bundled in an AutoGPT-style loop and tasked with a bold task such as “solve a big challenge of humanity, such as global hunger” or “make a billion-dollar business” it could meaningfully achieve a lot, via instrumental convergence and actual agency, including self-improvement, at which point it is effectively out of control.
I think that “build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + transparency tools + mulligans” won’t be enough to align such an AGI “once and forever”, and that actually trying to task thus-”aligned” AGI with a bold task won’t end well overall (even if it will happen to solve the given task.
But I do think that this will be enough to produce [epistemology, alignment, governance, infosec, infrastructure design, etc.] science safely within AGI labs with a very high probability, > 90%. By “safely producing science” I mean (1) Not spinning out of control while doing (largely or exclusively) theoretical science where any non-theoretical action, such as “let’s run a such-and-such simulation with such-and-such programmed actors”, or “let’s train such-and-such NN to test the such-and-such theory of representation learning” is supervised by human researchers, (2) Not being able to sneak “backdoors” or at least serious self-serving biases in the produced epistemology, alignment, and governance science that human scientists won’t be able to spot and correct when their review the AI outputs. (Caveat: this probably doesn’t apply to ethics-as-philosophy, as I discussed here, so ethics should be practised as a science as well, hence my references to ethical naturalism.)
Then, I detailed my disagreement with Shimi here. For alignment, we will need to apply multiple frameworks/paradigms simultaneously, and “RLHF + language feedback + debate/critiques + scalable oversight” is just one of these paradigms, which I called “linguistic process theories of alignment” in the linked post. We will need to apply several more paradigms simultaneously and continuously, throughout the training, deployment, and operations phase of AIs, such as Bayesian-style “world model alignment”, LeCun’s “emotional drives”, game-theoretic and mechanism-design-based process approaches (designing the right incentives for AIs during their development and deployment), (cooperative, inverse) RL paradigm, etc.
Specifically, I don’t think that if we apply all the above-mentioned approaches and a few more (which is, of course, itself not easy at all, as there are serious theoretical and practical issues with most of them, such as computational tractability or too high alignment tax, but I believe these problems are soluble, and the tax could be reduced by developing smart amortization techniques), we are still “doomed” because there is a glaring theoretical or strategic “hole” that must be closed with some totally new, totally unheard of or unexpected conceptual approach to alignment. As I described in the post, I think about this in terms of “the percent of the complexity of human values covered” with a portfolio of approaches, and when this portfolio is bigger than 3-4 approaches, almost all complexity of value is covered and it’s even not that important what specific approaches in this portfolio, as long as they are scientifically sound. That’s why I advocate marshalling most efforts behind existing approaches: it’s not that cooperative inverse RL, RLHF + language feedback, or Active Inference may turn out to be “totally useless”, and therefore working on maturing these approaches and turning them into practical process theories of alignment may be wasted effort. These approaches will be useful if they are applied, and naturally approaches that already have some momentum behind them are more likely to be applied.
Dealing with unaligned competition
You mentioned one potential “strategical hole” in the picture that I outlined above: an unaligned AI may have a competitive advantage over an AI aligned with a “portfolio of approaches” (at least the presently existing approaches), so, (I assume your argument goes,) we need an HRAD-style invention of “super capable, but also super corrigible” AI architecture. I’m personally doubtful that
But I’m not confident about either of these three propositions, so I should leave about 10% that they are all three false and hence that your strategic concern is valid.
However, I think we should address this strategic concern by rewiring the economic and action landscapes (which also interacts with the “game-theoretic, mechanism-design” alignment paradigm mentioned above). The current (internet) infrastructure and economic systems are not prepared for the emergence of powerful adversarial agents at all:
There are no systems of trust and authenticity verification at the root of internet communication (see https://trustoverip.org/)
The storage of information is centralised enormously (primarily in the data centres of BigCos such as Google, Meta, etc.)
Money has no trace, so one may earn money in arbitrary malicious or unlawful ways (i.e., gain instrumental power) and then use it to acquire resources from respectable places, e.g., paying for ML training compute at AWS or Azure and purchasing data from data providers. Formal regulations such as compute governance and data governance and human-based KYC procedures can only go so far and could probably be social-engineered by a superhuman imposter or persuader AI.
In essence, we want to design civilisational cooperation systems such that being aligned is a competitive advantage. Cf. “The Gaia Attractor” by Rafael Kaufmann.
This is a very ambitious program to rewire the entirety of the internet, other infrastructure, and the economy, but I believe this must be done anyway, just expecting a “miracle” HRAD invention to be sufficient without fixing the infrastructure and system design layers doesn’t sound like a good strategy. By the way, such infrastructure and economy rewiring is the real “pivotal act”.
I think we agree on a lot more than I realized! In particular, I don’t disagree with your general claims about pathways to HRAD through Alignment MVPs (though I hold some credence that this might not work). Things I disagree with:
I generally disagree with the claim “alignment approaches don’t limit agentic capability.” This was one subject of my independent research before I started directing MATS. Hopefully, I can publish some high-level summaries soon, time permitting! In short, I think “aligning models” generally trades off bits of optimization pressure with “making models performance-competitive,” which makes building aligned models less training-competitive for a given degree of performance.
I generally disagree with the claim “corrigibility is not a useful, coherent concept.” I think there is a (narrow) attractor basin around “corrigibility” in cognition space. Happy to discuss more and possibly update.
I generally disagree with the claim “provably-controllable highly reliable agent design is impossible in principle.” I think it is possible to design recursively self-improving programs that are robust to adversarial inputs, even if this is vanishingly hard in practice (which informs my sense of alignment difficulty only insomuch as I hope we don’t hit that attractor well before CEV value-loading is accomplished). Happy to discuss and possibly update.
I generally disagree with the implicit claim “it’s useful to try aligning AI systems via mechanism design on civilization.” This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent. I also don’t think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
I generally disagree with the implicit claim “it’s useful to try aligning AI systems via mechanism design on civilization.” This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent.
I didn’t imply that mechanism/higher-level-system design and game theory are alone sufficient for successful alignment. But as a part of a portfolio, I think it’s indispensable.
Probably the degree to which a person (let’s say, you or me) buy into the importance of mechanism/higher-level-system design for AI alignment corresponds to where we also land on the cognitivism—enactivism spectrum. If you are a hardcore cognitivist, you may think that just designing and training AI “in the right way” would be 100% sufficient. If you are a radical enactivist, you probably endorse the opposite view, that just designing incentives is necessary and sufficient, while designing and training AI “in the right way” is futile if the correct incentives are not in place.
I’m personally very uncertain about the cognitivism—enactivism spectrum (as well as, apparently, the scientific community—there are completely opposite positions out there held by respectable cognitive scientists), so from the “meta” rationality perspective, I should hedge bets and be sure not to neglect neither the cognitive representations and AI architecture nor the incentives and the environment.
I also don’t think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
Pre-AGI, I think there won’t be enough collective willingness to upend the economy and institute the “right” structures, anyway. The most realistic path (albeit still with a lot of failure modes) that I see is “alignment MVP tells us (or helps scientists to develop the necessary science) how the society, markets, and governance should really be structured, along with the AI architectures and training procedures for the next-gen AGI, which convinces scientists and decision-makers around the world”.
The difference is that after alignment MVP, first the sceptic voice “Let’s keep things as they always used to be, it’s all hype, there is no intelligence” should definitely cease completely at least among intelligent people. Second, alignment MVP should show that ending scarcity is a real possibility and this should weaken the grip of status quo economic and political incentives. Again, a lot of things could go wrong around this time, but it seems to me this is the path where OpenAI, Conjecture, and probably other AGI labs are aiming because they perceive it as the least risky or the only feasible one.
I’m somewhere in the middle of the cognitivist/enactivist spectrum. I think that e.g. relaxed adversarial training is motivated by trying to make an AI robust to arbitrary inputs it will receive in the world before it leaves the box. I’m sympathetic to the belief that this is computationally intractable; however, it feels more achievable than altering the world in the way I imagine would be necessary without it.
I’m not an idealist here: I think that some civilizational inadequacies should be addressed (e.g., better cooperation and commitment mechanisms) concurrent with in-the-box alignment strategies. My main hope is that we can build an in-the-box corrigible AGI that allows in-deployment modification.
MATS is currently focused on developing scholars as per the “T-model of research” with three main levers:
Mentorship (weekly meetings + mentor-specific curriculum);
Curriculum (seminars + workshops + Alignment 201 + topic study groups);
Scholar support (1-1s for unblocking + research strategy).
The “T-model of research” is:
Breadth-first search (literature reviews, building a “toolbox” of knowledge, noticing gaps);
Depth-first search (forming testable hypotheses, executing research, recursing appropriately, using checkpoints);
Originality (identifying threat models, backchaining to local search, applying builder/breaker methodology, babble and prune, “infinite-compute/time” style problem decompositions, etc.).
In the Winter 2022-23 Cohort, we ran several research strategy workshops (focusing on problem decomposition and strategy identification) and had dedicated scholar support staff who offered regular, airgapped 1-1 support for research strategy and researcher unblocking. We will publish some findings from our Scholar Support Post-Mortem soon. We plan to run further research strategy and “originality” workshops in the Summer 2023 Cohort and are currently reviewing our curriculum from the ground up.
Currently, we are not focused on “ethical naturalism” or similar as a curriculum element as it does not seem broadly useful for our cohort compared to project-specific, scholar-driven self-education. We are potentially open to hosting a seminar on this topic.
In terms of pointing program members to a list of scientific disciplines that are relevant to AI safety (corresponds to the “Agency and Alignment ‘Major’” section in the linked post), I’d propose also pointing to this list which is much fuller and has many references to relevant recent literature.
I’m somewhat against maximising originality in the context of alignment theories, specifically (as Adam Shimi has put it: “We need far more conceptual AI alignment research approaches than we have now if we want to increase our chances to solve the alignment problem.”). I argue that the optimal number of conceptual approaches to alignment that are developed by the industry, given the resource constraints and short timelines, is probably 5-7, or even smaller.
This doesn’t apply to the aspects of ensuring favourable AI transition of the civilisation “goes well” that are other than alignment, such as anomaly detection, resilience, information security implications, control and curation of the memetic space, post-AGI governance (of everything), etc., as well as theories of change that you cover in another comment. For these things, almost as much originality as possible seems to be good.
This is good to hear. Sounds to me like you were already doing most of what I’m suggesting, though this was not visible from the outside. Looking forward to reading the post-mortem.
I still think it would be very valuable if you invited scientists like Thomas Metzinger or Andy Clark to give talks on philosophy of cognitive science and alignment, prompting them with a question like “What do you think alignment/AI safety researchers should understand about philosophy of (cognitive) science?”
Thank you for recommending your study guide; it looks quite interesting.
MATS does not endorse “maximizing originality” in our curriculum. We believe that good original research in AI safety comes from a combination of broad interdisciplinary knowledge, deep technical knowledge, and strong epistemological investigation, which is why we emphasize all three. I’m a bit confused by your reference to Adam’s post. I interpret his post as advocating for more originality, not less, in terms of diverse alignment research agendas.
I think that some of the examples you gave of “non-alignment” research areas are potentially useful subproblems for what I term “impact alignment.” For example, strong anomaly detection (e.g., via mechanistic interpretability, OOD warning lights, or RAT-style acceptability verification) can help ensure “inner alignment” (e.g., through assessing corrigibility or myopia) and infosec/governance/meme-curation can help ensure “outer alignment,” “paying the alignment tax,” and mitigating “vulnerable world” situations with stolen/open sourced weaponizable models. I think this inner/outer distinction is a useful framing, though not the only way to carve reality at the joints, of course.
I think Metzinger or Clark could give interesting seminars, though I’m generally worried about encouraging the anthropomorphism of AGI. I like the “shoggoth” or “alien god” memetic framings of AGI, as these (while wrong) permit a superset of human-like behavior without restricting assumptions of model internals to (unlikely and optimistic, imo) human-like cognition. In this vein, I particularly like Steve Byrnes’ research as I feel it doesn’t overly anthropomorphize AGI and encourages the “competent psychopath with alien goals” memetic framing. I’m intrigued by this suggestion, however. How do you think Metzinger or Clark would specifically benefit our scholars?
(Note: I’ve tried to differentiate between MATS’ organizational position and mine by using “I” or “we” when appropriate.)
The quote by Shimi exemplifies the position with which I disagree. I believe Yudkowsky and Soares were also stating something along these lines previously, but I couldn’t find suitable quotes. I don’t know if any of these three people still hold this position, though.
I heard Metzinger reflecting on the passage of paradigms in cognitive science (here) so I think he would generally endorse the idea that I expressed here (I don’t claim though that I understand his philosophy perfectly and improve it; if I understood his exact version of this philosophy and how exactly it differs from mine I would likely change mine).
For Clark, I’m not sure what he would actually answer to the prompt like “What do you think are the blind spots or thinking biases of young AI (alignment, safety) researchers, what would you wish them to understand?”. Maybe his answer would be “I think their views are reasonable, I don’t see particular blind spots”. But also maybe not. This is more of an open-ended question which I cannot answer because I am myself probably susceptible to these exact blind spots or biases if he sees any from the “outside”.
Alignment research obviously needs to synthesise a deep understanding and steering of AIs with a deep understanding and steering (aka prompt engineering) of humans: the principles of human cognition and the nature of human values. So being exposed even to a rather “anthropocentric” philosophy of human cognition/values would be insightful and valuable for alignment researchers (as long as it is anthropocentric in a sound way, of course). Yet I think Clark’s philosophy is certainly more substrate- and mechanism-independent.
I don’t disagree with Shimi as strongly as you do. I think there’s some chance we need radically new paradigms of aligning AI than “build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + transparency tools + mulligans.”
While I do endorse some anthropocentric “value-loading”-based alignment strategies in my portfolio, such as Shard Theory and Steve Byrnes’ research, I worry about overly investing in anthropocentric AGI alignment strategies. I don’t necessarily think that RLHF shapes GPT-N in a manner similar to how natural selection and related processes shaped humans to be altruistic. I think it’s quite likely that the kind of cognition that GPT-N learns to predict tokens is more akin to an “alien god” than it is to human cognition. I think that trying to value-load an alien god is pretty hard.
In general, I don’t highly endorse the framing of alignment as “making AIs more human.” I think this kind of approach fails in some worlds and might produce models that are not performance-competitive enough to outcompete the unaligned models others deploy. I’d rather produce corrigible models with superhuman cognition coupled with robust democratic institutions. Nevertheless, I endorse at least some research along this line, but this is not the majority of my portfolio.
Seems to me that you misunderstood my position.
Let’s assume that GPT-5 will hit the AGI recursive self-improvement bar, so if it is bundled in an AutoGPT-style loop and tasked with a bold task such as “solve a big challenge of humanity, such as global hunger” or “make a billion-dollar business” it could meaningfully achieve a lot, via instrumental convergence and actual agency, including self-improvement, at which point it is effectively out of control.
I think that “build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + transparency tools + mulligans” won’t be enough to align such an AGI “once and forever”, and that actually trying to task thus-”aligned” AGI with a bold task won’t end well overall (even if it will happen to solve the given task.
But I do think that this will be enough to produce [epistemology, alignment, governance, infosec, infrastructure design, etc.] science safely within AGI labs with a very high probability, > 90%. By “safely producing science” I mean
(1) Not spinning out of control while doing (largely or exclusively) theoretical science where any non-theoretical action, such as “let’s run a such-and-such simulation with such-and-such programmed actors”, or “let’s train such-and-such NN to test the such-and-such theory of representation learning” is supervised by human researchers,
(2) Not being able to sneak “backdoors” or at least serious self-serving biases in the produced epistemology, alignment, and governance science that human scientists won’t be able to spot and correct when their review the AI outputs. (Caveat: this probably doesn’t apply to ethics-as-philosophy, as I discussed here, so ethics should be practised as a science as well, hence my references to ethical naturalism.)
Then, I detailed my disagreement with Shimi here. For alignment, we will need to apply multiple frameworks/paradigms simultaneously, and “RLHF + language feedback + debate/critiques + scalable oversight” is just one of these paradigms, which I called “linguistic process theories of alignment” in the linked post. We will need to apply several more paradigms simultaneously and continuously, throughout the training, deployment, and operations phase of AIs, such as Bayesian-style “world model alignment”, LeCun’s “emotional drives”, game-theoretic and mechanism-design-based process approaches (designing the right incentives for AIs during their development and deployment), (cooperative, inverse) RL paradigm, etc.
Specifically, I don’t think that if we apply all the above-mentioned approaches and a few more (which is, of course, itself not easy at all, as there are serious theoretical and practical issues with most of them, such as computational tractability or too high alignment tax, but I believe these problems are soluble, and the tax could be reduced by developing smart amortization techniques), we are still “doomed” because there is a glaring theoretical or strategic “hole” that must be closed with some totally new, totally unheard of or unexpected conceptual approach to alignment. As I described in the post, I think about this in terms of “the percent of the complexity of human values covered” with a portfolio of approaches, and when this portfolio is bigger than 3-4 approaches, almost all complexity of value is covered and it’s even not that important what specific approaches in this portfolio, as long as they are scientifically sound. That’s why I advocate marshalling most efforts behind existing approaches: it’s not that cooperative inverse RL, RLHF + language feedback, or Active Inference may turn out to be “totally useless”, and therefore working on maturing these approaches and turning them into practical process theories of alignment may be wasted effort. These approaches will be useful if they are applied, and naturally approaches that already have some momentum behind them are more likely to be applied.
Dealing with unaligned competition
You mentioned one potential “strategical hole” in the picture that I outlined above: an unaligned AI may have a competitive advantage over an AI aligned with a “portfolio of approaches” (at least the presently existing approaches), so, (I assume your argument goes,) we need an HRAD-style invention of “super capable, but also super corrigible” AI architecture. I’m personally doubtful that
alignment approaches actually limit agentic capability,
corrigibility is a useful, coherent concept (as I mentioned in this comment), and
provably-controllable highly reliable agent design is possible in principle: the law of requisite variety could demand AI to be “complex and messy” in order to be effective in the “complex and messy” world.
But I’m not confident about either of these three propositions, so I should leave about 10% that they are all three false and hence that your strategic concern is valid.
However, I think we should address this strategic concern by rewiring the economic and action landscapes (which also interacts with the “game-theoretic, mechanism-design” alignment paradigm mentioned above). The current (internet) infrastructure and economic systems are not prepared for the emergence of powerful adversarial agents at all:
There are no systems of trust and authenticity verification at the root of internet communication (see https://trustoverip.org/)
The storage of information is centralised enormously (primarily in the data centres of BigCos such as Google, Meta, etc.)
Money has no trace, so one may earn money in arbitrary malicious or unlawful ways (i.e., gain instrumental power) and then use it to acquire resources from respectable places, e.g., paying for ML training compute at AWS or Azure and purchasing data from data providers. Formal regulations such as compute governance and data governance and human-based KYC procedures can only go so far and could probably be social-engineered by a superhuman imposter or persuader AI.
In essence, we want to design civilisational cooperation systems such that being aligned is a competitive advantage. Cf. “The Gaia Attractor” by Rafael Kaufmann.
This is a very ambitious program to rewire the entirety of the internet, other infrastructure, and the economy, but I believe this must be done anyway, just expecting a “miracle” HRAD invention to be sufficient without fixing the infrastructure and system design layers doesn’t sound like a good strategy. By the way, such infrastructure and economy rewiring is the real “pivotal act”.
I think we agree on a lot more than I realized! In particular, I don’t disagree with your general claims about pathways to HRAD through Alignment MVPs (though I hold some credence that this might not work). Things I disagree with:
I generally disagree with the claim “alignment approaches don’t limit agentic capability.” This was one subject of my independent research before I started directing MATS. Hopefully, I can publish some high-level summaries soon, time permitting! In short, I think “aligning models” generally trades off bits of optimization pressure with “making models performance-competitive,” which makes building aligned models less training-competitive for a given degree of performance.
I generally disagree with the claim “corrigibility is not a useful, coherent concept.” I think there is a (narrow) attractor basin around “corrigibility” in cognition space. Happy to discuss more and possibly update.
I generally disagree with the claim “provably-controllable highly reliable agent design is impossible in principle.” I think it is possible to design recursively self-improving programs that are robust to adversarial inputs, even if this is vanishingly hard in practice (which informs my sense of alignment difficulty only insomuch as I hope we don’t hit that attractor well before CEV value-loading is accomplished). Happy to discuss and possibly update.
I generally disagree with the implicit claim “it’s useful to try aligning AI systems via mechanism design on civilization.” This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent. I also don’t think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
I didn’t imply that mechanism/higher-level-system design and game theory are alone sufficient for successful alignment. But as a part of a portfolio, I think it’s indispensable.
Probably the degree to which a person (let’s say, you or me) buy into the importance of mechanism/higher-level-system design for AI alignment corresponds to where we also land on the cognitivism—enactivism spectrum. If you are a hardcore cognitivist, you may think that just designing and training AI “in the right way” would be 100% sufficient. If you are a radical enactivist, you probably endorse the opposite view, that just designing incentives is necessary and sufficient, while designing and training AI “in the right way” is futile if the correct incentives are not in place.
I’m personally very uncertain about the cognitivism—enactivism spectrum (as well as, apparently, the scientific community—there are completely opposite positions out there held by respectable cognitive scientists), so from the “meta” rationality perspective, I should hedge bets and be sure not to neglect neither the cognitive representations and AI architecture nor the incentives and the environment.
Pre-AGI, I think there won’t be enough collective willingness to upend the economy and institute the “right” structures, anyway. The most realistic path (albeit still with a lot of failure modes) that I see is “alignment MVP tells us (or helps scientists to develop the necessary science) how the society, markets, and governance should really be structured, along with the AI architectures and training procedures for the next-gen AGI, which convinces scientists and decision-makers around the world”.
The difference is that after alignment MVP, first the sceptic voice “Let’s keep things as they always used to be, it’s all hype, there is no intelligence” should definitely cease completely at least among intelligent people. Second, alignment MVP should show that ending scarcity is a real possibility and this should weaken the grip of status quo economic and political incentives. Again, a lot of things could go wrong around this time, but it seems to me this is the path where OpenAI, Conjecture, and probably other AGI labs are aiming because they perceive it as the least risky or the only feasible one.
I’m somewhere in the middle of the cognitivist/enactivist spectrum. I think that e.g. relaxed adversarial training is motivated by trying to make an AI robust to arbitrary inputs it will receive in the world before it leaves the box. I’m sympathetic to the belief that this is computationally intractable; however, it feels more achievable than altering the world in the way I imagine would be necessary without it.
I’m not an idealist here: I think that some civilizational inadequacies should be addressed (e.g., better cooperation and commitment mechanisms) concurrent with in-the-box alignment strategies. My main hope is that we can build an in-the-box corrigible AGI that allows in-deployment modification.