in this post i give an overview of the state of my AI alignment research, as well as what i think needs to be worked on, notably for people who might want to join my efforts.
at this point, my general feeling is that i’m not very confused about how to save the world, at least from a top-level view. a lot of people spend a lot of time feeling confused about how to deal with current ML models and what those would optimize for and how to wrangle them — i don’t, i ignore these questions and skip straight to how to build some powerful aligned agentic thing to save the world, and my model of how to do that feels not very confused from a top-level perspective. it’s just that it’s gonna take a lot of work and many implementation details need to be filled in.
this has implications both for what the problem is and what the solution is: for the problem, it implies we don’t particularly see things coming and die quickly in a scenario akin to the kind yudkowsky would predict. for the solution, it implies that we might be able to use RSI to our advantage.
the one way to do this which i’d have any confidence in being continuously aligned / not being subject to the sharp left turn is by implementing what i call formal alignment: a formal-goal-maximizing AI, given a formal goal whose maximization actually leads to good worlds, such that more capabilities applied to maximizing it only improves our odds.
i believe we can build an RSI system which bootstraps such a scheme, and this can save us the very difficult work of building an accurate and robust-to-capabilities model of the world in the AI, ensuring it shares our concepts, and pointing to those; i explain this perspective in clarifying formal alignment implementation.
i’ve got something like a plan, and more importantly i’ve got — i think — a model of formal alignment that lets me do some exploration of the space of similar plans and update as i find better options. obtaining such a model seems important for anyone who’d want to join this general alignment agenda.
for someone to help, it’d probly be good for them to grok this model. other than that, work that needs to be done includes:
explore the space of formal alignment, both around the current local guessed-optimum by climbing along the hill of improvements, and by looking for entirely different plans such as this
working on formal “inner alignment”: what does it take to build a powerful formal-goal maximizing AI which actually maximizes its goals instead of doing something else like being overtaken by demons / mesa-optimizers
for any of these purposes, or anything else, you (yes, you!) are very welcome to get in touch with me (alignment discord, twitter, lesswrong, email visible at the top of my blog).
state of my alignment research, and what needs work
Link post
in this post i give an overview of the state of my AI alignment research, as well as what i think needs to be worked on, notably for people who might want to join my efforts.
at this point, my general feeling is that i’m not very confused about how to save the world, at least from a top-level view. a lot of people spend a lot of time feeling confused about how to deal with current ML models and what those would optimize for and how to wrangle them — i don’t, i ignore these questions and skip straight to how to build some powerful aligned agentic thing to save the world, and my model of how to do that feels not very confused from a top-level perspective. it’s just that it’s gonna take a lot of work and many implementation details need to be filled in.
threat model
i still think the intelligence explosion caused by recursive self-improvement (RSI) is the most likely way we die — unfortunately, my in-depth thoughts about this seem potentially capability exfohazardous.
this has implications both for what the problem is and what the solution is: for the problem, it implies we don’t particularly see things coming and die quickly in a scenario akin to the kind yudkowsky would predict. for the solution, it implies that we might be able to use RSI to our advantage.
theory of change
it seems to me like coordination is too hard and decisive-strategic-advantage-enabling capabilities are close at hand. for these reasons, the way that i see the world being saved is one organization on its own building an aligned, singleton AI which robustly saves the world forever.
the one way to do this which i’d have any confidence in being continuously aligned / not being subject to the sharp left turn is by implementing what i call formal alignment: a formal-goal-maximizing AI, given a formal goal whose maximization actually leads to good worlds, such that more capabilities applied to maximizing it only improves our odds.
i believe we can build an RSI system which bootstraps such a scheme, and this can save us the very difficult work of building an accurate and robust-to-capabilities model of the world in the AI, ensuring it shares our concepts, and pointing to those; i explain this perspective in clarifying formal alignment implementation.
my current best shot for an aligned formal goal is QACI (see also a story of how it could work and a tentative sketch at formalizing it), which implements something like coherent extrapolated volition by extending a “past user”’s reflection to be simulated/considered arbitrarily many times, until alignment is solved.
things that need work
i’ve got something like a plan, and more importantly i’ve got — i think — a model of formal alignment that lets me do some exploration of the space of similar plans and update as i find better options. obtaining such a model seems important for anyone who’d want to join this general alignment agenda.
for someone to help, it’d probly be good for them to grok this model. other than that, work that needs to be done includes:
explore the space of formal alignment, both around the current local guessed-optimum by climbing along the hill of improvements, and by looking for entirely different plans such as this
figure out some important pieces of math, such as locating patterns of bits in solomonoff hypotheses for worlds, running counterfactuals of them, and drawing up causation between them
finding other potentially useful true names in case they change the ease-of-implementation of various formal alignment schemes
working on formal “inner alignment”: what does it take to build a powerful formal-goal maximizing AI which actually maximizes its goals instead of doing something else like being overtaken by demons / mesa-optimizers
if you (yes, you!) are interested in helping with my agenda, but are worried that you might not be qualified, see so you think you’re not qualified to do technical alignment research?.
i’m increasingly investigating options to create an alignment organization and/or do some mentorship in a formal manner, but i’m also potentially open to doing some informal mentorship right now.
for any of these purposes, or anything else, you (yes, you!) are very welcome to get in touch with me (alignment discord, twitter, lesswrong, email visible at the top of my blog).