Epistemic status: I’m pretty confident about this; looking for feedback and red-teaming. As of 2023-06-30, I notice multiple epistemological errors in this document. I do not endorse the reasoning in it, and while my object-level claims haven’t changed radically, I am in the process of improving them using better epistemological procedures. I might update this after.
This post describes my current model of the world and the alignment problem, and my current plan to mitigate existential risk by misaligned ASI given these beliefs, my skills and my situation. It is written to both communicate my plan and to get feedback on it for improvement.
Timelines
Here are my current beliefs:
Recursive self-improvement (RSI) is a potent attractor in the space of capabilities in an AI systems.
Once a system achieves human-in-the-loop RSI (that is, RSI with the help of a research team helping it by providing compute and deploying capabilities improvements it comes up with), it can, within the span of weeks, achieve autonomous RSI (that is, RSI without a human-in-the-loop).
RSI is not bound by feedback loops to the real world.
You do not need real world data to improve an AI system’s capabilities enough to reach RSI.[1]
The large AI labs (OpenAI, Deepmind, Anthropic) will create sequence-modelling AI systems within the next 2.5 years that will be capable of at least human-in-the-loop RSI.
The first autonomous RSI AI system decides the fate of humanity after it fooms (that is, achieves superintelligence status, where it is significantly smarter and more capable than the entirely of humanity combined).
There are two main ways one can delay the creation of a ASI (so we have more time to solve the alignment problem): unilateral pivotal acts, and government-backed governance efforts to slow down AI capabilities research and investment. Unilateral pivotal acts seem like they burn a lot of trust and social capital that would actually accelerate race dynamics and animosity between parties competing to launch an ASI, which is why I believe that this should be avoided. In contrast, I expect government-backed efforts to regulate AI capabilities research and investment would not have such side effects and would give alignment researchers about a decade or two before someone creates an ASI, which may even be enough time to solve the problem.
Since the only relevant country when it comes to AI capabilities regulation is the USA, I am not in a position to impact government-backed governance efforts as an Indian citizen. This is especially true since my technical skills are significantly better than my skills at diplomacy and persuasion. Ergo, I must focus on alignment research and the researchers actually attempting to solve the problem.
Alignment research
I am most interested in alignment agendas that focus on what seem to be key bottlenecks to aligning ASIs. Specifically, the areas I believe are worth gigantic researcher investment are:
ensuring ontological robustness of goals and concepts (without the use of formally specified goals)
John Wentworth’s Understanding Agency agenda
Note: I still do not know what exactly what object-level research John Wentworth is doing, and I expect that to be deliberate. I expect that tracking and contributing to the cluster of researchers mentored by John would be more fruitful[3].
Here’s a list of existing alignment agendas I haven’t looked at yet and think may be high value enough to also make a significant impact on our ability to solve the problem:
Vannessa Kosoy’s Learning-Theoretic Agenda
I believe her work contributes to making progress on formal-goal alignment. More generally though, I am pessmistic that her work will make a huge impact in the time we have.
Evan Hubinger’s research
Based on his past research contributions, I expect that what he is working on is valuable. Unfortunately I do not know his current agenda well enough to comment on it.
Andrew Critch’s writings, which has been recommended to me by an alignment researcher I respect a lot
I also am excited about agendas that intend to accelerate alignment research in the time we have.
Cyborgism, or the cluster of projects intended to make LLMs more useful for alignment research, seems useful, especially since these projects seem to be exploring the space of tools that make LLMs significantly more useful for a research workflow. This also (surprisingly!) seems to be a neglected cause, which is why I am interested in following what the researchers here come up with.
While I find shard theory incredibly intellectually stimulating and fun to make progress on, I do not believe it is enough to solve the alignment problem since all of shard theory involves ontologically fragile solutions. Humans do not have ontologically robust formulations of terminal goals and a human whose intellectual capabilities is enhanced 1000x will necessarily be misaligned to their current self. At best, shard theory serves as a keystone for the most powerful non-ontologically-robust alignment strategy we will have, and that is not very useful given my model of the problem.
Mechanistic interpretability (defined as the ability to take a neural network and convert it into human-readable code that effectively does the same thing), at its limit, should lead to solving both inner misalignment and ontological robustness, but I am pessimistic that we solve mechanistic interpretability at its limit given my timeline. Worse, any progress towards mechanistic interpretability (also known simply as interpetability research) simply accelerates AI capabilities without any improvement in our ability to align AI models at their limit.
Shard theory research seems, on the surface, less damaging than interpretability reseach because shard theory research focuses purely on injecting goals and ensuring they continue to stay there, while interpretability actually improves our understanding of models and makes our architectures and training processes more efficient, reducing the compute bound required for us to get to RSI. However, shard theory research relies on interpretability research findings to make progress in detecting shards (here’s an example), and I assume progress in interpretability research would generally drive progress in non-ontologically-robust alignment approaches such as shard theory. I currently believe the trade-off of accelerating capabilities by doing interpretability research is very likely not worth the progress in non-ontologically-robust alignment agendas it unlocks, since it does not help with aligning AI models at their limit.
Finally, while I expect certain alignment agendas (such as Steven Byrnes’s Brain-like AGI) to produce positive expected value contributions to solving the problem, they do not aim at what I consider the core problems enough for me to personally track and potentially contribute to.
Note: There are many (new and established) independent alignment researchers who haven’t published written research contributions (or have explicitly defined their agenda), that make a significant positive impact on the alignment research community, and are worth talking to and working with. I simply have chosen to not list them here.
mesaoptimizer’s theory of change
My plan is simple: accelerate technical alignment research in the key research bottlenecks by whatever means I have at hand that make the biggest impact. This means, in the order of potential impact:
Create direct research contributions and output
Distill existing research contributions for existing and new alignment researchers, and pointing them in the direction of these bottlenecks (which is what Nate Soares seems to have been doing since late 2022 with his LessWrong posts)
Red-team research contributions in this space, and proposing improvements
Offer support and help to researchers working on these bottlenecks, when I am in the position to do so
The better my skills at direct technical conceptual research is, the bigger my impact. This means that I also should focus on improving my ability to do conceptual research work—but that is implicit in my definition of making direct research contributions anyway.
While I am uncertain about my ability to make original direct research contributions right now, I am confident of my ability to distill existing research contributions and to red-team existing research contributions. These seem to be relatively neglected ‘cause areas’ and I expect my intervention to make a big difference there (although not as much as direct research contributions).
I wish I could say that support for high value alignment researchers working on these bottlenecks is not neglected, but this is absolutely not the case. Funding and visas are the two key bottlenecks, and I believe that the current state of the ecosystem is abyssmal compared to how it should be if we were actually trying to solve the problem. Anyway, I believe I am agentic enough, and smart enough, to provide informal support (particularly in the form of logistics and ops work) to researchers who are not yet in a position where they do not need to worry about such ‘chores’ that get in the way of them doing actual research.
My personal bottlenecks to work on all of these things is visas and funding (with visas being a significantly more painful bottleneck than funding). In the worst case scenario, I may end up in a position where I have close-to-zero ability to make net positive contributions in these four ways towards solving the problem. I assign a 5% probability for this situation to occur in my timeline given my status as an Indian citizen. To mitigate this, I shall lookout for ways to extend my logistical runway to continue to be useful towards solving the problem. I prefer to not discuss specifics of my plans about this in this post, but feel free to message me to talk about it or offer advice.
The argument for why this is the case is outside the scope of this post, and probably a capability externalities infohazard, so I choose to not discuss it here.
Especially since his SERI MATS strategy seems to be to mainly develop independent alignment researchers who work with each other instead of with him or on his work
[outdated] My current theory of change to mitigate existential risk by misaligned ASI
Link post
Epistemic status:
I’m pretty confident about this; looking for feedback and red-teaming.As of 2023-06-30, I notice multiple epistemological errors in this document. I do not endorse the reasoning in it, and while my object-level claims haven’t changed radically, I am in the process of improving them using better epistemological procedures. I might update this after.This post describes my current model of the world and the alignment problem, and my current plan to mitigate existential risk by misaligned ASI given these beliefs, my skills and my situation. It is written to both communicate my plan and to get feedback on it for improvement.
Timelines
Here are my current beliefs:
Recursive self-improvement (RSI) is a potent attractor in the space of capabilities in an AI systems.
Once a system achieves human-in-the-loop RSI (that is, RSI with the help of a research team helping it by providing compute and deploying capabilities improvements it comes up with), it can, within the span of weeks, achieve autonomous RSI (that is, RSI without a human-in-the-loop).
RSI is not bound by feedback loops to the real world.
You do not need real world data to improve an AI system’s capabilities enough to reach RSI.[1]
The large AI labs (OpenAI, Deepmind, Anthropic) will create sequence-modelling AI systems within the next 2.5 years that will be capable of at least human-in-the-loop RSI.
The first autonomous RSI AI system decides the fate of humanity after it fooms (that is, achieves superintelligence status, where it is significantly smarter and more capable than the entirely of humanity combined).
Given the stated beliefs, my current probability distribution for the creation of an artificial superintelligence (ASI) over the next decade is roughly a normal distribution with a mean at 2.5 years from now (that is, around 2025-01-01), and a standard deviation of 0.5.[2]
There are two main ways one can delay the creation of a ASI (so we have more time to solve the alignment problem): unilateral pivotal acts, and government-backed governance efforts to slow down AI capabilities research and investment. Unilateral pivotal acts seem like they burn a lot of trust and social capital that would actually accelerate race dynamics and animosity between parties competing to launch an ASI, which is why I believe that this should be avoided. In contrast, I expect government-backed efforts to regulate AI capabilities research and investment would not have such side effects and would give alignment researchers about a decade or two before someone creates an ASI, which may even be enough time to solve the problem.
Since the only relevant country when it comes to AI capabilities regulation is the USA, I am not in a position to impact government-backed governance efforts as an Indian citizen. This is especially true since my technical skills are significantly better than my skills at diplomacy and persuasion. Ergo, I must focus on alignment research and the researchers actually attempting to solve the problem.
Alignment research
I am most interested in alignment agendas that focus on what seem to be key bottlenecks to aligning ASIs. Specifically, the areas I believe are worth gigantic researcher investment are:
ensuring ontological robustness of goals and concepts (without the use of formally specified goals)
creating formally specified outer aligned goals
preventing inner misalignment
understanding mesaoptimizers better by understanding agency
Here’s a list of existing alignment agendas I find high value enough to track, and if possible, contribute to:
Orthogonal’s QACI (a formally specified outer aligned goal)
Deepmind et al.’s Causal Incentives Group
John Wentworth’s Understanding Agency agenda
Note: I still do not know what exactly what object-level research John Wentworth is doing, and I expect that to be deliberate. I expect that tracking and contributing to the cluster of researchers mentored by John would be more fruitful[3].
Here’s a list of existing alignment agendas I haven’t looked at yet and think may be high value enough to also make a significant impact on our ability to solve the problem:
Diffractor’s research focus, particularly his work on the Infrabayesianism agenda
Davidad’s Open Agency Architecture
Vannessa Kosoy’s Learning-Theoretic Agenda
I believe her work contributes to making progress on formal-goal alignment. More generally though, I am pessmistic that her work will make a huge impact in the time we have.
Scott Garrabrant’s research (he worked on Logical Induction and Embedded Agency, and both seem to be extremely useful research contributions0
Evan Hubinger’s research
Based on his past research contributions, I expect that what he is working on is valuable. Unfortunately I do not know his current agenda well enough to comment on it.
ARC’s theoretical research, although I do not recall seeing significant contributions after ELK. More generally their methodology for doing theoretical research seems quite valuable.
While these are not necessarily agendas, here’s a list of researchers whose writings are worth reading (but that I haven’t entirely read):
Mark Xu’s writings seem like extremely useful distillations of conceptual alignment research, and I am interested in tracking it
Paul Christiano’s work (especially his older stuff), even if I disagree with his model of the alignment problem
Andrew Critch’s writings, which has been recommended to me by an alignment researcher I respect a lot
I also am excited about agendas that intend to accelerate alignment research in the time we have.
Cyborgism, or the cluster of projects intended to make LLMs more useful for alignment research, seems useful, especially since these projects seem to be exploring the space of tools that make LLMs significantly more useful for a research workflow. This also (surprisingly!) seems to be a neglected cause, which is why I am interested in following what the researchers here come up with.
Adam Shimi’s writings on epistemology for alignment research
While I find shard theory incredibly intellectually stimulating and fun to make progress on, I do not believe it is enough to solve the alignment problem since all of shard theory involves ontologically fragile solutions. Humans do not have ontologically robust formulations of terminal goals and a human whose intellectual capabilities is enhanced 1000x will necessarily be misaligned to their current self. At best, shard theory serves as a keystone for the most powerful non-ontologically-robust alignment strategy we will have, and that is not very useful given my model of the problem.
Mechanistic interpretability (defined as the ability to take a neural network and convert it into human-readable code that effectively does the same thing), at its limit, should lead to solving both inner misalignment and ontological robustness, but I am pessimistic that we solve mechanistic interpretability at its limit given my timeline. Worse, any progress towards mechanistic interpretability (also known simply as interpetability research) simply accelerates AI capabilities without any improvement in our ability to align AI models at their limit.
Shard theory research seems, on the surface, less damaging than interpretability reseach because shard theory research focuses purely on injecting goals and ensuring they continue to stay there, while interpretability actually improves our understanding of models and makes our architectures and training processes more efficient, reducing the compute bound required for us to get to RSI. However, shard theory research relies on interpretability research findings to make progress in detecting shards (here’s an example), and I assume progress in interpretability research would generally drive progress in non-ontologically-robust alignment approaches such as shard theory. I currently believe the trade-off of accelerating capabilities by doing interpretability research is very likely not worth the progress in non-ontologically-robust alignment agendas it unlocks, since it does not help with aligning AI models at their limit.
Finally, while I expect certain alignment agendas (such as Steven Byrnes’s Brain-like AGI) to produce positive expected value contributions to solving the problem, they do not aim at what I consider the core problems enough for me to personally track and potentially contribute to.
Note: There are many (new and established) independent alignment researchers who haven’t published written research contributions (or have explicitly defined their agenda), that make a significant positive impact on the alignment research community, and are worth talking to and working with. I simply have chosen to not list them here.
mesaoptimizer’s theory of change
My plan is simple: accelerate technical alignment research in the key research bottlenecks by whatever means I have at hand that make the biggest impact. This means, in the order of potential impact:
Create direct research contributions and output
Distill existing research contributions for existing and new alignment researchers, and pointing them in the direction of these bottlenecks (which is what Nate Soares seems to have been doing since late 2022 with his LessWrong posts)
Red-team research contributions in this space, and proposing improvements
Offer support and help to researchers working on these bottlenecks, when I am in the position to do so
The better my skills at direct technical conceptual research is, the bigger my impact. This means that I also should focus on improving my ability to do conceptual research work—but that is implicit in my definition of making direct research contributions anyway.
While I am uncertain about my ability to make original direct research contributions right now, I am confident of my ability to distill existing research contributions and to red-team existing research contributions. These seem to be relatively neglected ‘cause areas’ and I expect my intervention to make a big difference there (although not as much as direct research contributions).
I wish I could say that support for high value alignment researchers working on these bottlenecks is not neglected, but this is absolutely not the case. Funding and visas are the two key bottlenecks, and I believe that the current state of the ecosystem is abyssmal compared to how it should be if we were actually trying to solve the problem. Anyway, I believe I am agentic enough, and smart enough, to provide informal support (particularly in the form of logistics and ops work) to researchers who are not yet in a position where they do not need to worry about such ‘chores’ that get in the way of them doing actual research.
My personal bottlenecks to work on all of these things is visas and funding (with visas being a significantly more painful bottleneck than funding). In the worst case scenario, I may end up in a position where I have close-to-zero ability to make net positive contributions in these four ways towards solving the problem. I assign a 5% probability for this situation to occur in my timeline given my status as an Indian citizen. To mitigate this, I shall lookout for ways to extend my logistical runway to continue to be useful towards solving the problem. I prefer to not discuss specifics of my plans about this in this post, but feel free to message me to talk about it or offer advice.
Onward to utopia.
The argument for why this is the case is outside the scope of this post, and probably a capability externalities infohazard, so I choose to not discuss it here.
The normal distribution is a good default given my uncertainty regarding further details about what scenarios we shall see.
Especially since his SERI MATS strategy seems to be to mainly develop independent alignment researchers who work with each other instead of with him or on his work