So as an engineer I have trouble engaging with this as a problem.
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
“Plan the factory layout of the diamond synthesis plant with these requirements”.
“Order the equipment needed, here’s the payment credentials”.
“Supervise construction this workday comparing to original plans”
“Given this step of the plan, do it”
(Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
And so on. And any goal that isn’t something the model has empirical confidence in—because it’s in distribution for the training environment—an outer framework should block the unqualified model from attempting.
I think the problem MIRI has is this myopic model is not aware of context, and so it will do bad things sometimes. Maybe the diamonds are being cut into IC wafers and used in missiles to commit genocide.
Is that what it is? Or maybe the fear is that one of these tasks could go badly wrong? That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
I think the MIRI objection to that type of human-in-the-loop system is that it’s not optimal because sometimes such a system will have to punt back to the human, and that’s slow, and so the first effective system without a human in the loop will be vastly more effective and thus able to take over the world, hence the old “that’s safe but it doesn’t prevent someone else from destroying the world”.
We can’t just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so.
So my impression is that the MIRI viewpoint is that if humanity is to survive, someone needs to solve the “disempower anyone who could destroy the world” problem, and that they have to get that right on the first try, and that’s the hard part of the “alignment” problem. But I’m not super confident that that interpretation is correct and I’m quite confident that I find different parts of that salient than people in the MIRI idea space.
Anyone who largely agrees with the MIRI viewpoint want to weigh in here?
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
“Plan the factory layout of the diamond synthesis plant with these requirements”.
“Order the equipment needed, here’s the payment credentials”.
“Supervise construction this workday comparing to original plans”
“Given this step of the plan, do it”
(Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn’t to go build the thing; it’s that solving it is an indication that we’ve become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.
That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
If everyone in the world chooses to permanently use very weak systems because they’re scared of AI killing them, then yes, the impact of any given system failing will stay low. But that’s not what’s going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. ‘maybe humans don’t deserve to live’, ‘if I don’t do it someone else will anyway’, ‘if it’s that easy to destroy the world then we’re fucked anyway so I should just do the Modest thing of assuming nothing I do is that important’...).
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”. I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where “aligning” includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it’s powerful.)
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
This sounds like a good and reasonable approach, and also not at all like the sort of thing where you’re trying to instill any values at all into an ML system. I would call this “usable and robust tool construction” not “AI alignment”. I expect standard business practice will look something like this: even when using LLMs in a production setting, you generally want to feed it the minimum context to get the results you want, and to have it produce outputs in some strict and usable format.
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”.
“How can I build a system powerful enough to stop everyone else from doing stuff I don’t like” sounds like more of a capabilities problem than an alignment problem.
I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems
Yeah, this sounds right to me. I expect that there’s a lot of danger inherent in biological gain-of-function research, but I don’t think the solution to that is to create a virus that will infect people and cause symptoms that include “being less likely to research dangerous pathogens”. Similarly, I don’t think “do research on how to make systems that can do their own research even faster” is a promising approach to solve the “some research results can be misused or dangerous” problem.
So as an engineer I have trouble engaging with this as a problem.
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
“Plan the factory layout of the diamond synthesis plant with these requirements”.
“Order the equipment needed, here’s the payment credentials”.
“Supervise construction this workday comparing to original plans”
“Given this step of the plan, do it”
(Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
And so on. And any goal that isn’t something the model has empirical confidence in—because it’s in distribution for the training environment—an outer framework should block the unqualified model from attempting.
I think the problem MIRI has is this myopic model is not aware of context, and so it will do bad things sometimes. Maybe the diamonds are being cut into IC wafers and used in missiles to commit genocide.
Is that what it is? Or maybe the fear is that one of these tasks could go badly wrong? That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
I think the MIRI objection to that type of human-in-the-loop system is that it’s not optimal because sometimes such a system will have to punt back to the human, and that’s slow, and so the first effective system without a human in the loop will be vastly more effective and thus able to take over the world, hence the old “that’s safe but it doesn’t prevent someone else from destroying the world”.
So my impression is that the MIRI viewpoint is that if humanity is to survive, someone needs to solve the “disempower anyone who could destroy the world” problem, and that they have to get that right on the first try, and that’s the hard part of the “alignment” problem. But I’m not super confident that that interpretation is correct and I’m quite confident that I find different parts of that salient than people in the MIRI idea space.
Anyone who largely agrees with the MIRI viewpoint want to weigh in here?
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn’t to go build the thing; it’s that solving it is an indication that we’ve become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.
If everyone in the world chooses to permanently use very weak systems because they’re scared of AI killing them, then yes, the impact of any given system failing will stay low. But that’s not what’s going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. ‘maybe humans don’t deserve to live’, ‘if I don’t do it someone else will anyway’, ‘if it’s that easy to destroy the world then we’re fucked anyway so I should just do the Modest thing of assuming nothing I do is that important’...).
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”. I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where “aligning” includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it’s powerful.)
Thanks for the reply.
This sounds like a good and reasonable approach, and also not at all like the sort of thing where you’re trying to instill any values at all into an ML system. I would call this “usable and robust tool construction” not “AI alignment”. I expect standard business practice will look something like this: even when using LLMs in a production setting, you generally want to feed it the minimum context to get the results you want, and to have it produce outputs in some strict and usable format.
“How can I build a system powerful enough to stop everyone else from doing stuff I don’t like” sounds like more of a capabilities problem than an alignment problem.
Yeah, this sounds right to me. I expect that there’s a lot of danger inherent in biological gain-of-function research, but I don’t think the solution to that is to create a virus that will infect people and cause symptoms that include “being less likely to research dangerous pathogens”. Similarly, I don’t think “do research on how to make systems that can do their own research even faster” is a promising approach to solve the “some research results can be misused or dangerous” problem.