Rob Bensinger comments on “AI Alignment” is a Dangerously Overloaded Term

Rob Bensinger 16 Dec 2023 23:01 UTC
14 points
2
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
1. “Plan the factory layout of the diamond synthesis plant with these requirements”.
2. “Order the equipment needed, here’s the payment credentials”.
3. “Supervise construction this workday comparing to original plans”
4. “Given this step of the plan, do it”
5. (Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn’t to go build the thing; it’s that solving it is an indication that we’ve become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.
That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
If everyone in the world chooses to permanently use very weak systems because they’re scared of AI killing them, then yes, the impact of any given system failing will stay low. But that’s not what’s going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. ‘maybe humans don’t deserve to live’, ‘if I don’t do it someone else will anyway’, ‘if it’s that easy to destroy the world then we’re fucked anyway so I should just do the Modest thing of assuming nothing I do is that important’...).
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”. I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where “aligning” includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it’s powerful.)
- faul_sname 21 Dec 2023 4:33 UTC
  2 points
  0
  Parent
  Thanks for the reply.
  
  That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
  
  This sounds like a good and reasonable approach, and also not at all like the sort of thing where you’re trying to instill any values at all into an ML system. I would call this “usable and robust tool construction” not “AI alignment”. I expect standard business practice will look something like this: even when using LLMs in a production setting, you generally want to feed it the minimum context to get the results you want, and to have it produce outputs in some strict and usable format.
  
  The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”.
  
  “How can I build a system powerful enough to stop everyone else from doing stuff I don’t like” sounds like more of a capabilities problem than an alignment problem.
  
  I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems
  
  Yeah, this sounds right to me. I expect that there’s a lot of danger inherent in biological gain-of-function research, but I don’t think the solution to that is to create a virus that will infect people and cause symptoms that include “being less likely to research dangerous pathogens”. Similarly, I don’t think “do research on how to make systems that can do their own research even faster” is a promising approach to solve the “some research results can be misused or dangerous” problem.