Or in less metaphorical language, the worry is that mostly that it’s hard to give the AI the specific goal you want to give it, not so much that it’s hard to make it have any goal at all.
At least some people are worried about the latter, for a very particular meaning of the word “goal”. From that post:
Finally, I’ll note that the diamond maximization problem is not in fact the problem “build an AI that makes a little diamond”, nor even “build an AI that probably makes a decent amount of diamond, while also spending lots of other resources on lots of other stuff” (although the latter is more progress than the former). The diamond maximization problem (as originally posed by MIRI folk) is a challenge of building an AI that definitely optimizes for a particular simple thing, on the theory that if we knew how to do that (in unrealistically simplified models, allowing for implausible amounts of (hyper)computation) then we would have learned something significant about how to point cognition at targets in general.
I think to some extent this is a matter of “yes, I see that you’ve solved the problem in practical terms, and yes, every time we try to implement the theoretically optimal solution it fails due to Goodharting, but we really want the theoretically optimal solution”, which is… not universally agreed, to say the least. But it is a concern some people have.
Hm, I think that paragraph is talking about the problem of getting an AI to care about a specific particular thing of your choosing (here diamond-maximising), not any arbitrary particular thing at all with no control over what it is. The MIRI-esque view thinks the former is hard and the latter happens inevitably.
I don’t think we have any way of getting an AI to “care about” any arbitrary particular thing at all, by the “attempt to maximize that thing, self-correct towards maximizing that thing if the current strategies are not working” definition of “care about”. Even if we relax the “and we pick the thing it tries to maximize” constraint.
“We don’t currently have any way of getting any system to learn to robustly optimize for any specific goal once it enters an environment very different from the one it learned in” is my own view, not Nate’s.
Like I think the MIRI folks are concerned with “how do you get an AGI to robustly maximize any specific static utility function that you choose”.
I am aware that the MIRI people think that the latter is inevitable. However, as far as I know, we don’t have even a single demonstration of “some real-world system that robustly maximizes any specific static utility function, even if that utility function was not chosen by anyone in particular”, nor do we have any particular reason to believe that such a system is practical.
And I think Nate’s comment makes it pretty clear that “robustly maximize some particular thing” is what he cares about.
To be clear: The diamond maximizer problem is about getting specific intended content into the AI’s goals (“diamonds” as opposed to some random physical structure it’s maximizing), not just about building a stable maximizer.
If you relax the “specific intended content” constraint, and allow for maximizing any random physical structure, as long as it’s always the same physical structure in the real world and not just some internal metric that has historically correlated with the amount of that structure that existed in the real world, does that make the problem any easier / is there a known solution? My vague impression was that the answer was still “no, that’s also not a thing we know how to do”.
So as an engineer I have trouble engaging with this as a problem.
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
“Plan the factory layout of the diamond synthesis plant with these requirements”.
“Order the equipment needed, here’s the payment credentials”.
“Supervise construction this workday comparing to original plans”
“Given this step of the plan, do it”
(Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
And so on. And any goal that isn’t something the model has empirical confidence in—because it’s in distribution for the training environment—an outer framework should block the unqualified model from attempting.
I think the problem MIRI has is this myopic model is not aware of context, and so it will do bad things sometimes. Maybe the diamonds are being cut into IC wafers and used in missiles to commit genocide.
Is that what it is? Or maybe the fear is that one of these tasks could go badly wrong? That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
I think the MIRI objection to that type of human-in-the-loop system is that it’s not optimal because sometimes such a system will have to punt back to the human, and that’s slow, and so the first effective system without a human in the loop will be vastly more effective and thus able to take over the world, hence the old “that’s safe but it doesn’t prevent someone else from destroying the world”.
We can’t just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so.
So my impression is that the MIRI viewpoint is that if humanity is to survive, someone needs to solve the “disempower anyone who could destroy the world” problem, and that they have to get that right on the first try, and that’s the hard part of the “alignment” problem. But I’m not super confident that that interpretation is correct and I’m quite confident that I find different parts of that salient than people in the MIRI idea space.
Anyone who largely agrees with the MIRI viewpoint want to weigh in here?
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
“Plan the factory layout of the diamond synthesis plant with these requirements”.
“Order the equipment needed, here’s the payment credentials”.
“Supervise construction this workday comparing to original plans”
“Given this step of the plan, do it”
(Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn’t to go build the thing; it’s that solving it is an indication that we’ve become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.
That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
If everyone in the world chooses to permanently use very weak systems because they’re scared of AI killing them, then yes, the impact of any given system failing will stay low. But that’s not what’s going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. ‘maybe humans don’t deserve to live’, ‘if I don’t do it someone else will anyway’, ‘if it’s that easy to destroy the world then we’re fucked anyway so I should just do the Modest thing of assuming nothing I do is that important’...).
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”. I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where “aligning” includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it’s powerful.)
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
This sounds like a good and reasonable approach, and also not at all like the sort of thing where you’re trying to instill any values at all into an ML system. I would call this “usable and robust tool construction” not “AI alignment”. I expect standard business practice will look something like this: even when using LLMs in a production setting, you generally want to feed it the minimum context to get the results you want, and to have it produce outputs in some strict and usable format.
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”.
“How can I build a system powerful enough to stop everyone else from doing stuff I don’t like” sounds like more of a capabilities problem than an alignment problem.
I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems
Yeah, this sounds right to me. I expect that there’s a lot of danger inherent in biological gain-of-function research, but I don’t think the solution to that is to create a virus that will infect people and cause symptoms that include “being less likely to research dangerous pathogens”. Similarly, I don’t think “do research on how to make systems that can do their own research even faster” is a promising approach to solve the “some research results can be misused or dangerous” problem.
This is rather off-topic here, but for any AI that has an LLM as a component of it, I don’t believe diamond-maximization is a hard problem, apart from Inner Alignment problems. The LLM knows the meaning of the word ‘diamond’ (GPT-4 defined it as “Diamond is a solid form of the element carbon with its atoms arranged in a crystal structure called diamond cubic. It has the highest hardness and thermal conductivity of any natural material, properties that are utilized in major industrial applications such as cutting and polishing tools. Diamond also has high optical dispersion, making it useful in jewelry as a gemstone that can scatter light in a spectrum of colors.”). The LLM also knows its physical and optical properties, its social, industrial and financial value, its crystal structure (with images and angles and coordinates), what carbon is, its chemical properties, how many electrons, protons and neutrons a carbon atom can have, its terrestrial isotopic ratios, the half-life of carbon-14, what quarks a neutron is made of, etc. etc. etc. — where it fits in a vast network of facts about the world. Even if the AI also had some other very different internal world model and ontology, there’s only going to be one “Rosetta Stone” optimal-fit mapping between the human ontology that the LLM has a vast amount of information about and any other arbitrary ontology, so there’s more than enough information in that network of relationships to uniquely locate the concepts in that other ontology corresponding to ‘diamond’. This is still true even if the other ontology is larger and more sophisticated: for example, locating Newtonian physics in relativistic quantum field theory and mapping a setup from the former to the latter isn’t hard: its structure is very clearly just the large-scale low-speed limiting approximation.
The point where this gets a little more challenging is Outer Alignment, where you want to write a mathematical or pseudocode reward function for training a diamond optimizer using Reinforcement Learning (assuming our AI doesn’t just have a terminal goal utility function slot that we can directly connect this function to, like AIXI): then you need to also locate the concepts in the other ontology for each element in something along the lines of “pessimizingly estimate the total number of moles of diamond (having at a millimeter-scale-average any isotopic ratio of C-12 to C-13 but no more than N1 times the average terrestrial proportion of C-14, discounting any carbon atoms within N2 C-C bonds of a crystal-structure boundary, or within N3 bonds of a crystal -structure dislocation, or within N4 bonds of a lattice substitution or vacancy, etc. …) at the present timeslice in your current rest frame inside the region of space within the future-directed causal lightcone of your creation, and subtract the answer for the same calculation in a counterfactual alternative world-history where you had permanently shut down immediately upon being created, but the world-history was otherwise unchanged apart from future causal consequences of that divergence”. [Obviously this is a specification design problem, and the example specification above may still have bugs and/or omissions, but there will only be a finite number of these, and debugging this is an achievable goal, especially if you have a crystalographer, a geologist, and a jeweler helping you, and if a non-diamond-maximizing AI also helps by asking you probing questions. There are people whose jobs involve writing specifications like this, including in situations with opposing optimization pressure.]
As mentioned above, I fully acknowledge that this still leaves the usual Inner Alignment problems unsolved: applying Reinforcement Learning (or something similar such as Direct Preference Optimization) with this reward function to our AI, then how do we ensure that it actually becomes a diamond maximizer, rather than a biased estimator of diamond? I suspect we might want to look at some form of GAN, where the reward-estimating circuitry it not part of the Reinforcement Learning process, but is being trained in some other way. That still leaves the Inner Alignment problem of training a diamond maximizer instead of a hacker of reward model estimators.
I think if we’re fine with building an “increaser of diamonds in familiar contexts”, that’s pretty easy, and yeah I think “wrap an LLM or similar” is a promising approach. If we want “maximize diamonds, even in unfamiliar contexts”, I think that’s a harder problem, and my impression is that the MIRI folks think the latter one is the important one to solve.
What in my diamond maximization proposal above only works in familiar contexts? Most of it is (unsurprisingly) about crystalography and isotopic ratios, plus a standard causal wrapper. (If you look carefully, I even allowed for the possibility of FTL.)
The obvious “brute force” solution to aimability is a practical, approximately Bayesian, GOFAI equivalant of AIXI that is capable of tool use and contains an LLM as a tool;. This is extremely aimable — it has an explicit slot to plug a utility function in. Which makes it extremely easy to build a diamond maximizer, or a paperclip maximizer, or any other such x-risk. Then we need to instead plug in something that hopefully isn’t an x-risk, like value learning or CEV or “solve goalcraft” as the terminal goal: figure out what we want, then optimize that, while appropriately pessimizing that optimization over remaining uncertainties in “what we want”.
At least some people are worried about the latter, for a very particular meaning of the word “goal”. From that post:
I think to some extent this is a matter of “yes, I see that you’ve solved the problem in practical terms, and yes, every time we try to implement the theoretically optimal solution it fails due to Goodharting, but we really want the theoretically optimal solution”, which is… not universally agreed, to say the least. But it is a concern some people have.
Hm, I think that paragraph is talking about the problem of getting an AI to care about a specific particular thing of your choosing (here diamond-maximising), not any arbitrary particular thing at all with no control over what it is. The MIRI-esque view thinks the former is hard and the latter happens inevitably.
I don’t think we have any way of getting an AI to “care about” any arbitrary particular thing at all, by the “attempt to maximize that thing, self-correct towards maximizing that thing if the current strategies are not working” definition of “care about”. Even if we relax the “and we pick the thing it tries to maximize” constraint.
I don’t think that that’s the view of whoever wrote the paragraph you’re quoting, but at this point we’re doing exegesis
“We don’t currently have any way of getting any system to learn to robustly optimize for any specific goal once it enters an environment very different from the one it learned in” is my own view, not Nate’s.
Like I think the MIRI folks are concerned with “how do you get an AGI to robustly maximize any specific static utility function that you choose”.
I am aware that the MIRI people think that the latter is inevitable. However, as far as I know, we don’t have even a single demonstration of “some real-world system that robustly maximizes any specific static utility function, even if that utility function was not chosen by anyone in particular”, nor do we have any particular reason to believe that such a system is practical.
And I think Nate’s comment makes it pretty clear that “robustly maximize some particular thing” is what he cares about.
To be clear: The diamond maximizer problem is about getting specific intended content into the AI’s goals (“diamonds” as opposed to some random physical structure it’s maximizing), not just about building a stable maximizer.
Thanks for the clarification!
If you relax the “specific intended content” constraint, and allow for maximizing any random physical structure, as long as it’s always the same physical structure in the real world and not just some internal metric that has historically correlated with the amount of that structure that existed in the real world, does that make the problem any easier / is there a known solution? My vague impression was that the answer was still “no, that’s also not a thing we know how to do”.
I expect it makes it easier, but I don’t think it’s solved.
So as an engineer I have trouble engaging with this as a problem.
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
“Plan the factory layout of the diamond synthesis plant with these requirements”.
“Order the equipment needed, here’s the payment credentials”.
“Supervise construction this workday comparing to original plans”
“Given this step of the plan, do it”
(Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
And so on. And any goal that isn’t something the model has empirical confidence in—because it’s in distribution for the training environment—an outer framework should block the unqualified model from attempting.
I think the problem MIRI has is this myopic model is not aware of context, and so it will do bad things sometimes. Maybe the diamonds are being cut into IC wafers and used in missiles to commit genocide.
Is that what it is? Or maybe the fear is that one of these tasks could go badly wrong? That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
I think the MIRI objection to that type of human-in-the-loop system is that it’s not optimal because sometimes such a system will have to punt back to the human, and that’s slow, and so the first effective system without a human in the loop will be vastly more effective and thus able to take over the world, hence the old “that’s safe but it doesn’t prevent someone else from destroying the world”.
So my impression is that the MIRI viewpoint is that if humanity is to survive, someone needs to solve the “disempower anyone who could destroy the world” problem, and that they have to get that right on the first try, and that’s the hard part of the “alignment” problem. But I’m not super confident that that interpretation is correct and I’m quite confident that I find different parts of that salient than people in the MIRI idea space.
Anyone who largely agrees with the MIRI viewpoint want to weigh in here?
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn’t to go build the thing; it’s that solving it is an indication that we’ve become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.
If everyone in the world chooses to permanently use very weak systems because they’re scared of AI killing them, then yes, the impact of any given system failing will stay low. But that’s not what’s going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. ‘maybe humans don’t deserve to live’, ‘if I don’t do it someone else will anyway’, ‘if it’s that easy to destroy the world then we’re fucked anyway so I should just do the Modest thing of assuming nothing I do is that important’...).
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”. I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where “aligning” includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it’s powerful.)
Thanks for the reply.
This sounds like a good and reasonable approach, and also not at all like the sort of thing where you’re trying to instill any values at all into an ML system. I would call this “usable and robust tool construction” not “AI alignment”. I expect standard business practice will look something like this: even when using LLMs in a production setting, you generally want to feed it the minimum context to get the results you want, and to have it produce outputs in some strict and usable format.
“How can I build a system powerful enough to stop everyone else from doing stuff I don’t like” sounds like more of a capabilities problem than an alignment problem.
Yeah, this sounds right to me. I expect that there’s a lot of danger inherent in biological gain-of-function research, but I don’t think the solution to that is to create a virus that will infect people and cause symptoms that include “being less likely to research dangerous pathogens”. Similarly, I don’t think “do research on how to make systems that can do their own research even faster” is a promising approach to solve the “some research results can be misused or dangerous” problem.
This is rather off-topic here, but for any AI that has an LLM as a component of it, I don’t believe diamond-maximization is a hard problem, apart from Inner Alignment problems. The LLM knows the meaning of the word ‘diamond’ (GPT-4 defined it as “Diamond is a solid form of the element carbon with its atoms arranged in a crystal structure called diamond cubic. It has the highest hardness and thermal conductivity of any natural material, properties that are utilized in major industrial applications such as cutting and polishing tools. Diamond also has high optical dispersion, making it useful in jewelry as a gemstone that can scatter light in a spectrum of colors.”). The LLM also knows its physical and optical properties, its social, industrial and financial value, its crystal structure (with images and angles and coordinates), what carbon is, its chemical properties, how many electrons, protons and neutrons a carbon atom can have, its terrestrial isotopic ratios, the half-life of carbon-14, what quarks a neutron is made of, etc. etc. etc. — where it fits in a vast network of facts about the world. Even if the AI also had some other very different internal world model and ontology, there’s only going to be one “Rosetta Stone” optimal-fit mapping between the human ontology that the LLM has a vast amount of information about and any other arbitrary ontology, so there’s more than enough information in that network of relationships to uniquely locate the concepts in that other ontology corresponding to ‘diamond’. This is still true even if the other ontology is larger and more sophisticated: for example, locating Newtonian physics in relativistic quantum field theory and mapping a setup from the former to the latter isn’t hard: its structure is very clearly just the large-scale low-speed limiting approximation.
The point where this gets a little more challenging is Outer Alignment, where you want to write a mathematical or pseudocode reward function for training a diamond optimizer using Reinforcement Learning (assuming our AI doesn’t just have a terminal goal utility function slot that we can directly connect this function to, like AIXI): then you need to also locate the concepts in the other ontology for each element in something along the lines of “pessimizingly estimate the total number of moles of diamond (having at a millimeter-scale-average any isotopic ratio of C-12 to C-13 but no more than N1 times the average terrestrial proportion of C-14, discounting any carbon atoms within N2 C-C bonds of a crystal-structure boundary, or within N3 bonds of a crystal -structure dislocation, or within N4 bonds of a lattice substitution or vacancy, etc. …) at the present timeslice in your current rest frame inside the region of space within the future-directed causal lightcone of your creation, and subtract the answer for the same calculation in a counterfactual alternative world-history where you had permanently shut down immediately upon being created, but the world-history was otherwise unchanged apart from future causal consequences of that divergence”. [Obviously this is a specification design problem, and the example specification above may still have bugs and/or omissions, but there will only be a finite number of these, and debugging this is an achievable goal, especially if you have a crystalographer, a geologist, and a jeweler helping you, and if a non-diamond-maximizing AI also helps by asking you probing questions. There are people whose jobs involve writing specifications like this, including in situations with opposing optimization pressure.]
As mentioned above, I fully acknowledge that this still leaves the usual Inner Alignment problems unsolved: applying Reinforcement Learning (or something similar such as Direct Preference Optimization) with this reward function to our AI, then how do we ensure that it actually becomes a diamond maximizer, rather than a biased estimator of diamond? I suspect we might want to look at some form of GAN, where the reward-estimating circuitry it not part of the Reinforcement Learning process, but is being trained in some other way. That still leaves the Inner Alignment problem of training a diamond maximizer instead of a hacker of reward model estimators.
I think if we’re fine with building an “increaser of diamonds in familiar contexts”, that’s pretty easy, and yeah I think “wrap an LLM or similar” is a promising approach. If we want “maximize diamonds, even in unfamiliar contexts”, I think that’s a harder problem, and my impression is that the MIRI folks think the latter one is the important one to solve.
What in my diamond maximization proposal above only works in familiar contexts? Most of it is (unsurprisingly) about crystalography and isotopic ratios, plus a standard causal wrapper. (If you look carefully, I even allowed for the possibility of FTL.)
The obvious “brute force” solution to aimability is a practical, approximately Bayesian, GOFAI equivalant of AIXI that is capable of tool use and contains an LLM as a tool;. This is extremely aimable — it has an explicit slot to plug a utility function in. Which makes it extremely easy to build a diamond maximizer, or a paperclip maximizer, or any other such x-risk. Then we need to instead plug in something that hopefully isn’t an x-risk, like value learning or CEV or “solve goalcraft” as the terminal goal: figure out what we want, then optimize that, while appropriately pessimizing that optimization over remaining uncertainties in “what we want”.