How are people actually trying to make friendly AI? Here are few simplified examples
LessWrong has some greattechnical and criticaloverviews of alignment agendas, but for many readers they take too long to read.
But first, what is the problem we are trying to solve? I would say that we want to program AI’s that have our values (or better). And we want those values to persist, even when the AI gets smarter, even after long spans of subjective time, even when encountering situations humans have never encountered before.
This is hard because we don’t know how to do any of those things! So how do people propose to solve this/
Here’s my attempt at cartoonishly simplified explanations of technical alignment efforts:
Just use dumb AI’s to build an understandable world simulation, then train a smart AI in that simulation so that we can verify that it’s aligned—DavidAD’s plan
simulate a human thinking for a very long time about the alignment problem, and use whatever solution they write down—TODO Another proposal by Christiano (2012):
I’ve left out the many debates over the proposals. I’m afraid that you need to dig much deeper to judge which methods will work. If you want to know more, just follow the links below.
If you dislike this: please help me make it better by contributing better summaries, and I’ll be pleased to include them.
If you would like to know more, I recommend these overviews:
If anyone finds this useful, please let me know. I’ve abandoned it because none of my test audience found it interesting or useful. That’s OK, it just means it’s better to focus on other things.
Draft: A cartoonishly simplified overview of some technical AI alignment efforts
Epistemic status: excessive lossy compression applied
How are people actually trying to make friendly AI? Here are few simplified examples
LessWrong has some great technical and critical overviews of alignment agendas, but for many readers they take too long to read.
But first, what is the problem we are trying to solve? I would say that we want to program AI’s that have our values (or better). And we want those values to persist, even when the AI gets smarter, even after long spans of subjective time, even when encountering situations humans have never encountered before.
This is hard because we don’t know how to do any of those things! So how do people propose to solve this/
Here’s my attempt at cartoonishly simplified explanations of technical alignment efforts:
Let’s make AI solve it
Just make a friendly dumb AI and that will make a smarter friendly AI, and so on—Iterated Amplification
Just make AI’s debate each other, so that the truth comes out when we look at both sides of the argument
Just have the AI follow a constitution
Just make a friendly automatic alignment researcher—Superalignment
Let’s have lots of AI’s interacting:
Just have multiple focused AIs competing in various roles, kind of like a corporation—Drexler’s The Open Agency Model
Just use dumb AI’s to build an understandable world simulation, then train a smart AI in that simulation so that we can verify that it’s aligned—DavidAD’s plan
Just have the AI learn and respect other beings’ boundaries—Boundaries/Membranes
Let’s make sure the goal is good
Just design the AI to want to help humans but be maximally uncertain about how to do that, so it constantly seeks guidance—CIRL
Just make a lazy AI that is satisfied after doing “enough”—Mild Optimization
Let’s build tools that will let us control smarter AI’s
Just read their minds—ELK
Just edit their minds—Activation Steering
Let’s understand more
What it means to be an agent—agent foundations
Some historic proposals sounded promising but seem to have been abandoned fow now, I include this to show how hard the problem is:
Just keep the AI in a sandbox environment where it can’t cause any real-world harm while we continue to align and train it—AI Boxing
Just make an oracle AI that only talks, not acts
simulate a human thinking for a very long time about the alignment problem, and use whatever solution they write down—TODO Another proposal by Christiano (2012):
Do what we would wish for if we knew more, thought faster, were more the people we wished we were—coherent extrapolated volition (CEV)
I’ve left out the many debates over the proposals. I’m afraid that you need to dig much deeper to judge which methods will work. If you want to know more, just follow the links below.
If you dislike this: please help me make it better by contributing better summaries, and I’ll be pleased to include them.
If you would like to know more, I recommend these overviews:
2023 - Shallow review of live agendas in alignment & safety—I’ve taken heavily from this post which has one sentence summaries as well as much, much more
2022 A newcomer’s guide to the technical AI safety field
2023 - A Brief Overview of AI Safety/Alignment Orgs, Fields, Researchers, and Resources for ML Researchers
2023 - The Genie in the Bottle: An Introduction to AI Alignment and Risk
2022 - On how various plans miss the hard bits of the alignment challenge
2022 - (My understanding of) What Everyone in Technical Alignment is Doing and Why
If anyone finds this useful, please let me know. I’ve abandoned it because none of my test audience found it interesting or useful. That’s OK, it just means it’s better to focus on other things.
In particular, I’d be keen to know what @Stag and @technicalities think, as this was in large part inspired by the desire to further simplify and categorise the “one sentence summaries” from their excellent Shallow review of live agendas in alignment & safety