wassname comments on wassname’s Shortform

wassname 11 Oct 2024 5:31 UTC
5 points
0
Draft: A cartoonishly simplified overview of some technical AI alignment efforts

Epistemic status: excessive lossy compression applied

How are people actually trying to make friendly AI? Here are few simplified examples

LessWrong has some great technical and critical overviews of alignment agendas, but for many readers they take too long to read.

Here’s my attempt at cartoonishly simplified explanations of technical alignment efforts:
- Let’s make AI solve it
  - Just make a friendly dumb AI and that will make a smarter friendly AI, and so on—Iterated Amplification
  - Just make AI’s debate each other, so that the truth comes out when we look at both sides of the argument
  - Just have the AI follow a constitution
  - Just make a friendly automatic alignment researcher—Superalignment
- Let’s have lots of AI’s interacting:
  - Just have multiple focused AIs competing in various roles, kind of like a corporation—Drexler’s The Open Agency Model
  - Just use dumb AI’s to build an understandable world simulation, then train a smart AI in that simulation so that we can verify that it’s aligned—DavidAD’s plan
  - Just have the AI learn and respect other beings’ boundaries—Boundaries/Membranes
- Let’s make sure the goal is good
  - Just design the AI to want to help humans but be maximally uncertain about how to do that, so it constantly seeks guidance—CIRL
  - Just make a lazy AI that is satisfied after doing “enough”—Mild Optimization
- Let’s build tools that will let us control smarter AI’s
  - Just read their minds—ELK
  - Just edit their minds—Activation Steering
- Let’s understand more
  - What it means to be an agent—agent foundations
Some historic proposals sounded promising but seem to have been abandoned fow now, I include this to show how hard the problem is:
- Just keep the AI in a sandbox environment where it can’t cause any real-world harm while we continue to align and train it—AI Boxing
- Just make an oracle AI that only talks, not acts
- simulate a human thinking for a very long time about the alignment problem, and use whatever solution they write down—TODO Another proposal by Christiano (2012):
- Do what we would wish for if we knew more, thought faster, were more the people we wished we were—coherent extrapolated volition (CEV)
I’ve left out the many debates over the proposals. I’m afraid that you need to dig much deeper to judge which methods will work. If you want to know more, just follow the links below.

If you dislike this: please help me make it better by contributing better summaries, and I’ll be pleased to include them.

If you would like to know more, I recommend these overviews:
- 2023 - Shallow review of live agendas in alignment & safety—I’ve taken heavily from this post which has one sentence summaries as well as much, much more
- 2022 A newcomer’s guide to the technical AI safety field
- 2023 - A Brief Overview of AI Safety/Alignment Orgs, Fields, Researchers, and Resources for ML Researchers
- 2023 - The Genie in the Bottle: An Introduction to AI Alignment and Risk
- 2022 - On how various plans miss the hard bits of the alignment challenge
- 2022 - (My understanding of) What Everyone in Technical Alignment is Doing and Why
- wassname 11 Oct 2024 5:31 UTC
  2 points
  −1
  Parent
  If anyone finds this useful, please let me know. I’ve abandoned it because none of my test audience found it interesting or useful. That’s OK, it just means it’s better to focus on other things.
  - wassname 11 Oct 2024 6:03 UTC
    1 point
    0
    Parent
    In particular, I’d be keen to know what @Stag and @technicalities think, as this was in large part inspired by the desire to further simplify and categorise the “one sentence summaries” from their excellent Shallow review of live agendas in alignment & safety

wassname comments on wassname’s Shortform

Draft: A cartoonishly simplified overview of some technical AI alignment efforts