[outdated] My current theory of change to mitigate existential risk by misaligned ASI

mesaoptimizer21 May 2023 13:46 UTC

32 points

Epistemic status: ~~I’m pretty confident about this; looking for feedback and red-teaming.~~ As of 2023-06-30, I notice multiple epistemological errors in this document. I do not endorse the reasoning in it, and while my object-level claims haven’t changed radically, I am in the process of improving them using better epistemological procedures. I might update this after.

This post describes my current model of the world and the alignment problem, and my current plan to mitigate existential risk by misaligned ASI given these beliefs, my skills and my situation. It is written to both communicate my plan and to get feedback on it for improvement.

Timelines

Here are my current beliefs:

Recursive self-improvement (RSI) is a potent attractor in the space of capabilities in an AI systems.
Once a system achieves human-in-the-loop RSI (that is, RSI with the help of a research team helping it by providing compute and deploying capabilities improvements it comes up with), it can, within the span of weeks, achieve autonomous RSI (that is, RSI without a human-in-the-loop).
RSI is not bound by feedback loops to the real world.
You do not need real world data to improve an AI system’s capabilities enough to reach RSI.^[1]
The large AI labs (OpenAI, Deepmind, Anthropic) will create sequence-modelling AI systems within the next 2.5 years that will be capable of at least human-in-the-loop RSI.
The first autonomous RSI AI system decides the fate of humanity after it fooms (that is, achieves superintelligence status, where it is significantly smarter and more capable than the entirely of humanity combined).

Given the stated beliefs, my current probability distribution for the creation of an artificial superintelligence (ASI) over the next decade is roughly a normal distribution with a mean at 2.5 years from now (that is, around 2025-01-01), and a standard deviation of 0.5.^[2]

There are two main ways one can delay the creation of a ASI (so we have more time to solve the alignment problem): unilateral pivotal acts, and government-backed governance efforts to slow down AI capabilities research and investment. Unilateral pivotal acts seem like they burn a lot of trust and social capital that would actually accelerate race dynamics and animosity between parties competing to launch an ASI, which is why I believe that this should be avoided. In contrast, I expect government-backed efforts to regulate AI capabilities research and investment would not have such side effects and would give alignment researchers about a decade or two before someone creates an ASI, which may even be enough time to solve the problem.

Since the only relevant country when it comes to AI capabilities regulation is the USA, I am not in a position to impact government-backed governance efforts as an Indian citizen. This is especially true since my technical skills are significantly better than my skills at diplomacy and persuasion. Ergo, I must focus on alignment research and the researchers actually attempting to solve the problem.

Alignment research

I am most interested in alignment agendas that focus on what seem to be key bottlenecks to aligning ASIs. Specifically, the areas I believe are worth gigantic researcher investment are:

ensuring ontological robustness of goals and concepts (without the use of formally specified goals)
creating formally specified outer aligned goals
preventing inner misalignment
understanding mesaoptimizers better by understanding agency

Here’s a list of existing alignment agendas I find high value enough to track, and if possible, contribute to:

Orthogonal’s QACI (a formally specified outer aligned goal)
Deepmind et al.’s Causal Incentives Group
John Wentworth’s Understanding Agency agenda
Note: I still do not know what exactly what object-level research John Wentworth is doing, and I expect that to be deliberate. I expect that tracking and contributing to the cluster of researchers mentored by John would be more fruitful^[3].

Here’s a list of existing alignment agendas I haven’t looked at yet and think may be high value enough to also make a significant impact on our ability to solve the problem:

Diffractor’s research focus, particularly his work on the Infrabayesianism agenda
Davidad’s Open Agency Architecture
Vannessa Kosoy’s Learning-Theoretic Agenda
I believe her work contributes to making progress on formal-goal alignment. More generally though, I am pessmistic that her work will make a huge impact in the time we have.
Scott Garrabrant’s research (he worked on Logical Induction and Embedded Agency, and both seem to be extremely useful research contributions0
Evan Hubinger’s research
Based on his past research contributions, I expect that what he is working on is valuable. Unfortunately I do not know his current agenda well enough to comment on it.
ARC’s theoretical research, although I do not recall seeing significant contributions after ELK. More generally their methodology for doing theoretical research seems quite valuable.

While these are not necessarily agendas, here’s a list of researchers whose writings are worth reading (but that I haven’t entirely read):

Mark Xu’s writings seem like extremely useful distillations of conceptual alignment research, and I am interested in tracking it
Paul Christiano’s work (especially his older stuff), even if I disagree with his model of the alignment problem
Andrew Critch’s writings, which has been recommended to me by an alignment researcher I respect a lot

I also am excited about agendas that intend to accelerate alignment research in the time we have.

Cyborgism, or the cluster of projects intended to make LLMs more useful for alignment research, seems useful, especially since these projects seem to be exploring the space of tools that make LLMs significantly more useful for a research workflow. This also (surprisingly!) seems to be a neglected cause, which is why I am interested in following what the researchers here come up with.
Adam Shimi’s writings on epistemology for alignment research

While I find shard theory incredibly intellectually stimulating and fun to make progress on, I do not believe it is enough to solve the alignment problem since all of shard theory involves ontologically fragile solutions. Humans do not have ontologically robust formulations of terminal goals and a human whose intellectual capabilities is enhanced 1000x will necessarily be misaligned to their current self. At best, shard theory serves as a keystone for the most powerful non-ontologically-robust alignment strategy we will have, and that is not very useful given my model of the problem.

Mechanistic interpretability (defined as the ability to take a neural network and convert it into human-readable code that effectively does the same thing), at its limit, should lead to solving both inner misalignment and ontological robustness, but I am pessimistic that we solve mechanistic interpretability at its limit given my timeline. Worse, any progress towards mechanistic interpretability (also known simply as interpetability research) simply accelerates AI capabilities without any improvement in our ability to align AI models at their limit.

Shard theory research seems, on the surface, less damaging than interpretability reseach because shard theory research focuses purely on injecting goals and ensuring they continue to stay there, while interpretability actually improves our understanding of models and makes our architectures and training processes more efficient, reducing the compute bound required for us to get to RSI. However, shard theory research relies on interpretability research findings to make progress in detecting shards (here’s an example), and I assume progress in interpretability research would generally drive progress in non-ontologically-robust alignment approaches such as shard theory. I currently believe the trade-off of accelerating capabilities by doing interpretability research is very likely not worth the progress in non-ontologically-robust alignment agendas it unlocks, since it does not help with aligning AI models at their limit.

Finally, while I expect certain alignment agendas (such as Steven Byrnes’s Brain-like AGI) to produce positive expected value contributions to solving the problem, they do not aim at what I consider the core problems enough for me to personally track and potentially contribute to.

Note: There are many (new and established) independent alignment researchers who haven’t published written research contributions (or have explicitly defined their agenda), that make a significant positive impact on the alignment research community, and are worth talking to and working with. I simply have chosen to not list them here.

mesaoptimizer’s theory of change

My plan is simple: accelerate technical alignment research in the key research bottlenecks by whatever means I have at hand that make the biggest impact. This means, in the order of potential impact:

Create direct research contributions and output
Distill existing research contributions for existing and new alignment researchers, and pointing them in the direction of these bottlenecks (which is what Nate Soares seems to have been doing since late 2022 with his LessWrong posts)
Red-team research contributions in this space, and proposing improvements
Offer support and help to researchers working on these bottlenecks, when I am in the position to do so

The better my skills at direct technical conceptual research is, the bigger my impact. This means that I also should focus on improving my ability to do conceptual research work—but that is implicit in my definition of making direct research contributions anyway.

While I am uncertain about my ability to make original direct research contributions right now, I am confident of my ability to distill existing research contributions and to red-team existing research contributions. These seem to be relatively neglected ‘cause areas’ and I expect my intervention to make a big difference there (although not as much as direct research contributions).

I wish I could say that support for high value alignment researchers working on these bottlenecks is not neglected, but this is absolutely not the case. Funding and visas are the two key bottlenecks, and I believe that the current state of the ecosystem is abyssmal compared to how it should be if we were actually trying to solve the problem. Anyway, I believe I am agentic enough, and smart enough, to provide informal support (particularly in the form of logistics and ops work) to researchers who are not yet in a position where they do not need to worry about such ‘chores’ that get in the way of them doing actual research.

My personal bottlenecks to work on all of these things is visas and funding (with visas being a significantly more painful bottleneck than funding). In the worst case scenario, I may end up in a position where I have close-to-zero ability to make net positive contributions in these four ways towards solving the problem. I assign a 5% probability for this situation to occur in my timeline given my status as an Indian citizen. To mitigate this, I shall lookout for ways to extend my logistical runway to continue to be useful towards solving the problem. I prefer to not discuss specifics of my plans about this in this post, but feel free to message me to talk about it or offer advice.

Onward to utopia.

↩︎
The argument for why this is the case is outside the scope of this post, and probably a capability externalities infohazard, so I choose to not discuss it here.
↩︎
The normal distribution is a good default given my uncertainty regarding further details about what scenarios we shall see.
↩︎
Especially since his SERI MATS strategy seems to be to mainly develop independent alignment researchers who work with each other instead of with him or on his work

mesaoptimizer21 May 2023 13:46 UTC

32 points

8 comments6 min readLW link

Shmi 21 May 2023 22:39 UTC
4 points
0
Did your model change in the last 6 months or so, since the GPTx takeover? If so, how? Or is it a new model? If so, can you mentally go back to pre-GP-3.5 and construct the model then? Basically, I wonder which of your beliefs changes since then.
- mesaoptimizer 22 May 2023 6:55 UTC
  5 points
  2
  Parent
  Your question seems to focus mainly on timeline model and not alignment model, so I shall focus on explaining how my model of the timeline has changed.
  
  My timeline shortened from about four years (mean probability) to my current timeline of about 2.5 years (mean probability) since the GPT-4 release. This was because of two reasons:
  - gut-level update on GPT-4′s capability increases: we seem quite close to human-in-the-loop RSI.
  - a more accurate model for bounds on RSI. I had previously thought that RSI would be more difficult than I think it is now.
  The latter is more load-bearing than the former, although my predictions for how soon AI labs will achieve human-in-the-loop RSI creates an upper bound on how much time we have (assuming no slowdown), which is quite useful when making your timeline.
Daniel Kokotajlo 21 May 2023 21:26 UTC
2 points
0
Nice post! You seem like you know what you are doing. I’d be curious to hear more about what you think about these priority areas, and why interpretability didn’t make the list:
- ensuring ontological robustness of goals and concepts (without the use of formally specified goals)
- creating formally specified outer aligned goals
- preventing inner misalignment
- understanding mesaoptimizers better by understanding agency
Thanks and good luck!
- mesaoptimizer 26 May 2023 15:40 UTC
  3 points
  0
  Parent
  Sorry for the late reply: I wrote up an answer but due to a server-side error during submission, I lost it. I shall answer the interpretability question first.
  
  Interpretability didn’t make the list because of the following beliefs of mine:
  - Interpretability—specifically interpretability-after-training—seems to aim, at the limit, for ontology identification, which is very different from ontological robustness. Ontology identification is useful for specific safety interventions such as scalable oversight, which seems like a viable alignment strategy, but I doubt this strategy scales until ASI. I expect it to break almost immediately as someone begins a human-in-the-loop RSI, especially since I expect (at the very least) significant changes in the architecture of neural network models that would result in capability improvements. This is why I predict that investing in interpretability research is not the best idea.
  - A counterpoint is the notion that we can accelerate alignment with sufficiently capable aligned ‘oracle’ models—and this seems to be OpenAI’s current strategy: build ‘oracle’ models that are aligned enough to accelerate alignment research, and use better alignment techniques on the more capable models. Since one can both accelerate capabilities research and alignment research with capable enough oracle models, however, OpenAI would also choose to accelerate capabilities research alongside their attempt to accelerate alignment research. The question then is whether OpenAI is cautious enough as they balance out the two—and recent events have not made me optimistic about this being the case.
  - Interpretability research does help accelerate some of the alignment agendas I have listed by providing insights that may be broad enough to help; but I expect that such insights to probably be found through other approaches too, and the fact that interpretability research either involves not working on more robust alignment plans or leads to capability insights, both seem to make me averse to considering working on interpretability research.
  Here’s a few facets of interpretability research that I am enthusiastic about tracking, but not excited enough to want to work on, as of writing:
  - Interpretability-during-training probably would be really useful, and I am more optimistic about it than interpretability-after-training. I expect that at the limit, interpretability-during-training leads to progress towards ensuring ontological robustness of values.
  - Interpretability (both after-training and during-training) will help with detecting and making interventions when it comes to inner misalignment. That’s a great benefit, that I haven’t really thought about until I decided to reflect and answer your question.
  - Interpretability research seems very focused on ‘oracles’—sequence modellers and supervised learning systems—and interpretability research on RL models seems neglected. I would like to see more research done on such models, because RL-style systems seems more likely to lead to RSI and ASI, and insights we gain might help alignment research in general.
  I’m really glad you asked me this question! You’ve helped me elicit (and develop) a more nuanced view on interpretability research.
Vladimir_Nesov 21 May 2023 14:45 UTC
2 points
0

unilateral pivotal acts, and government-backed governance efforts to slow down AI capabilities research and investment

There is nothing inherently unilateral about pivotal acts. The problem with an international moratorium is that with enforcement tools that are readily available it’s unlikely to last as long as it needs to for human-level alignment theory to catch up. Being government-backed is not part of the problem. Pivotal AI-enabled trajectories of development can help with that, by providing the tools for a more reliable international moratorium and for getting to a place where the field of alignment is actually ready for tackling more capability.
- mesaoptimizer 22 May 2023 8:34 UTC
  5 points
  −2
  Parent
  When I referred to pivotal acts, I implied the use of enforcement tools that are extremely powerful, of the sort implied in AGI Ruin. That is, enforcement tools that make an actual impact in extending timelines^[1]. Perhaps I should start using a more precise term to describe this from now on.
  
  It is hard for me to imagine how there can be consensus within a US government organization capable of launching a superhuman-enforcement-tool-based pivotal act (such as three letter agencies) to initiate a moratorium, much less consensus in the US government or between US and EU (especially given the rather interesting strategy EU is trying with their AI Act).
  
  I continue to consider all superhuman-enforcement-tool-based pivotal acts as unilateral given this belief. My use of the world “unilateral” points to the fact that the organizations and people who currently have a non-trivial influence over the state of the world and its future will almost entirely be blindsided by the pivotal act, and that will result in destruction of trust and chaos and an increase in conflict. And I currently believe that this is actually more likely to increase P(doom) or existential risk for humanity, even if it extends the foom timeline.
  ↩︎
  Although not preventing ASI creation entirely. The destruction of humanity’s potential is also an existential risk, and the inability for us to create a utopia is too painful to bear.
- Daniel Kokotajlo 21 May 2023 21:24 UTC
  2 points
  0
  Parent
  How long do you think such a moratorium would last?
  - Vladimir_Nesov 21 May 2023 22:33 UTC
    2 points
    0
    Parent
    There is nothing physically impossible about it lasting however long it needs to, that’s only implausible for the same political and epistemic reasons that any global moratorium at all is implausible. GPUs don’t grow on trees.
    
    My point in the above comment is that pivotal acts don’t by their nature stay apart, a conventional moratorium that actually helps is also a pivotal act. Pivotal act AIs are something like task AIs that can plausibly be made to achieve a strategically relevant effect relatively safely, well in advance of actually having an understanding necessary to align a general agentic superintelligence, using alignment techniques designed around lack of such an understanding. Advances made by humans with use of task AIs could then increase robustness of a moratorium’s enforcement (better cybersecurity and compute governance), reduce the downsides of the moratorium’s presence (tool AIs allowed to make biotech advancements), and ultimately move towards being predictably ready for a superintelligent AI, which might initially look like developing alignment techniques that work for making more and more powerful task AIs safely. Scalable molecular manufacturing of compute is an obvious landmark, and can’t end well without robust compute governance already in place. Human uploading is another tool that can plausibly be used to improve global security without having a better understanding of AI alignment.
    
    (I don’t see what we currently know justifying Hanson’s concern of never making enough progress to lift a value drift moratorium. If theoretical progress can get feedback from gradually improving task AIs, there is a long way to go before concluding that the process would peter out before superintelligence, so that taking any sort of plunge is remotely sane for the world. We haven’t been at it for even a million years yet.)