The value of AI aimability may be overblown. If AI is not aimable, its goals will perform eternal random walk and thus AI will cause only short-term risk—no risk of world takeover. (Some may comment that after random walk, it will stack in some Waluigi state forever—but if it is actually works in getting fix goal system, why we do not research such strange attractors in the space of AI goals?)
AI will become global-catastrophically-dangerous only after aimability will be solved. Research in aimability only brings this moment closer.
The wording “AI alignment” is precluding us to see this risk, as it combines aimability and giving nice goals to AI.
Yes, this is a good point. Aimability Research increases the kurtosis of the AI outcome distribution, making both the right tail (paradise) and the left tail (total annihilation) heavier, and reducing the so-so outcomes in the center.
Only Goalcrafting Research can change the relative weights.
The aspect of aimability where an AI becomes able to want something in particular consistently improves capabilities, and improved capabilities make AI matter a lot more. This might happen without ability to aim an AI where you want it aimed, another key aspect. Without the latter aspect, aimability is not “solved”, yet AIs become dangerous.
A gun which is not easily aimable doesn’t shoot bullets on random walks.
Or in less metaphorical language, the worry is that mostly that it’s hard to give the AI the specific goal you want to give it, not so much that it’s hard to make it have any goal at all. I think people generally expect that naively training an AGI without thinking about alignment will get you a goal-directed system, it just might not have the goal you want it to.
The practical effect of very inaccurate guns in the past was that guns mattered less and battles were often won by bayonet charges or morale. So I think it’s fair to conclude that Aimability just makes AI matter a lot more.
Or in less metaphorical language, the worry is that mostly that it’s hard to give the AI the specific goal you want to give it, not so much that it’s hard to make it have any goal at all.
At least some people are worried about the latter, for a very particular meaning of the word “goal”. From that post:
Finally, I’ll note that the diamond maximization problem is not in fact the problem “build an AI that makes a little diamond”, nor even “build an AI that probably makes a decent amount of diamond, while also spending lots of other resources on lots of other stuff” (although the latter is more progress than the former). The diamond maximization problem (as originally posed by MIRI folk) is a challenge of building an AI that definitely optimizes for a particular simple thing, on the theory that if we knew how to do that (in unrealistically simplified models, allowing for implausible amounts of (hyper)computation) then we would have learned something significant about how to point cognition at targets in general.
I think to some extent this is a matter of “yes, I see that you’ve solved the problem in practical terms, and yes, every time we try to implement the theoretically optimal solution it fails due to Goodharting, but we really want the theoretically optimal solution”, which is… not universally agreed, to say the least. But it is a concern some people have.
Hm, I think that paragraph is talking about the problem of getting an AI to care about a specific particular thing of your choosing (here diamond-maximising), not any arbitrary particular thing at all with no control over what it is. The MIRI-esque view thinks the former is hard and the latter happens inevitably.
I don’t think we have any way of getting an AI to “care about” any arbitrary particular thing at all, by the “attempt to maximize that thing, self-correct towards maximizing that thing if the current strategies are not working” definition of “care about”. Even if we relax the “and we pick the thing it tries to maximize” constraint.
“We don’t currently have any way of getting any system to learn to robustly optimize for any specific goal once it enters an environment very different from the one it learned in” is my own view, not Nate’s.
Like I think the MIRI folks are concerned with “how do you get an AGI to robustly maximize any specific static utility function that you choose”.
I am aware that the MIRI people think that the latter is inevitable. However, as far as I know, we don’t have even a single demonstration of “some real-world system that robustly maximizes any specific static utility function, even if that utility function was not chosen by anyone in particular”, nor do we have any particular reason to believe that such a system is practical.
And I think Nate’s comment makes it pretty clear that “robustly maximize some particular thing” is what he cares about.
To be clear: The diamond maximizer problem is about getting specific intended content into the AI’s goals (“diamonds” as opposed to some random physical structure it’s maximizing), not just about building a stable maximizer.
If you relax the “specific intended content” constraint, and allow for maximizing any random physical structure, as long as it’s always the same physical structure in the real world and not just some internal metric that has historically correlated with the amount of that structure that existed in the real world, does that make the problem any easier / is there a known solution? My vague impression was that the answer was still “no, that’s also not a thing we know how to do”.
So as an engineer I have trouble engaging with this as a problem.
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
“Plan the factory layout of the diamond synthesis plant with these requirements”.
“Order the equipment needed, here’s the payment credentials”.
“Supervise construction this workday comparing to original plans”
“Given this step of the plan, do it”
(Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
And so on. And any goal that isn’t something the model has empirical confidence in—because it’s in distribution for the training environment—an outer framework should block the unqualified model from attempting.
I think the problem MIRI has is this myopic model is not aware of context, and so it will do bad things sometimes. Maybe the diamonds are being cut into IC wafers and used in missiles to commit genocide.
Is that what it is? Or maybe the fear is that one of these tasks could go badly wrong? That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
I think the MIRI objection to that type of human-in-the-loop system is that it’s not optimal because sometimes such a system will have to punt back to the human, and that’s slow, and so the first effective system without a human in the loop will be vastly more effective and thus able to take over the world, hence the old “that’s safe but it doesn’t prevent someone else from destroying the world”.
We can’t just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so.
So my impression is that the MIRI viewpoint is that if humanity is to survive, someone needs to solve the “disempower anyone who could destroy the world” problem, and that they have to get that right on the first try, and that’s the hard part of the “alignment” problem. But I’m not super confident that that interpretation is correct and I’m quite confident that I find different parts of that salient than people in the MIRI idea space.
Anyone who largely agrees with the MIRI viewpoint want to weigh in here?
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
“Plan the factory layout of the diamond synthesis plant with these requirements”.
“Order the equipment needed, here’s the payment credentials”.
“Supervise construction this workday comparing to original plans”
“Given this step of the plan, do it”
(Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn’t to go build the thing; it’s that solving it is an indication that we’ve become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.
That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
If everyone in the world chooses to permanently use very weak systems because they’re scared of AI killing them, then yes, the impact of any given system failing will stay low. But that’s not what’s going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. ‘maybe humans don’t deserve to live’, ‘if I don’t do it someone else will anyway’, ‘if it’s that easy to destroy the world then we’re fucked anyway so I should just do the Modest thing of assuming nothing I do is that important’...).
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”. I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where “aligning” includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it’s powerful.)
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
This sounds like a good and reasonable approach, and also not at all like the sort of thing where you’re trying to instill any values at all into an ML system. I would call this “usable and robust tool construction” not “AI alignment”. I expect standard business practice will look something like this: even when using LLMs in a production setting, you generally want to feed it the minimum context to get the results you want, and to have it produce outputs in some strict and usable format.
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”.
“How can I build a system powerful enough to stop everyone else from doing stuff I don’t like” sounds like more of a capabilities problem than an alignment problem.
I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems
Yeah, this sounds right to me. I expect that there’s a lot of danger inherent in biological gain-of-function research, but I don’t think the solution to that is to create a virus that will infect people and cause symptoms that include “being less likely to research dangerous pathogens”. Similarly, I don’t think “do research on how to make systems that can do their own research even faster” is a promising approach to solve the “some research results can be misused or dangerous” problem.
This is rather off-topic here, but for any AI that has an LLM as a component of it, I don’t believe diamond-maximization is a hard problem, apart from Inner Alignment problems. The LLM knows the meaning of the word ‘diamond’ (GPT-4 defined it as “Diamond is a solid form of the element carbon with its atoms arranged in a crystal structure called diamond cubic. It has the highest hardness and thermal conductivity of any natural material, properties that are utilized in major industrial applications such as cutting and polishing tools. Diamond also has high optical dispersion, making it useful in jewelry as a gemstone that can scatter light in a spectrum of colors.”). The LLM also knows its physical and optical properties, its social, industrial and financial value, its crystal structure (with images and angles and coordinates), what carbon is, its chemical properties, how many electrons, protons and neutrons a carbon atom can have, its terrestrial isotopic ratios, the half-life of carbon-14, what quarks a neutron is made of, etc. etc. etc. — where it fits in a vast network of facts about the world. Even if the AI also had some other very different internal world model and ontology, there’s only going to be one “Rosetta Stone” optimal-fit mapping between the human ontology that the LLM has a vast amount of information about and any other arbitrary ontology, so there’s more than enough information in that network of relationships to uniquely locate the concepts in that other ontology corresponding to ‘diamond’. This is still true even if the other ontology is larger and more sophisticated: for example, locating Newtonian physics in relativistic quantum field theory and mapping a setup from the former to the latter isn’t hard: its structure is very clearly just the large-scale low-speed limiting approximation.
The point where this gets a little more challenging is Outer Alignment, where you want to write a mathematical or pseudocode reward function for training a diamond optimizer using Reinforcement Learning (assuming our AI doesn’t just have a terminal goal utility function slot that we can directly connect this function to, like AIXI): then you need to also locate the concepts in the other ontology for each element in something along the lines of “pessimizingly estimate the total number of moles of diamond (having at a millimeter-scale-average any isotopic ratio of C-12 to C-13 but no more than N1 times the average terrestrial proportion of C-14, discounting any carbon atoms within N2 C-C bonds of a crystal-structure boundary, or within N3 bonds of a crystal -structure dislocation, or within N4 bonds of a lattice substitution or vacancy, etc. …) at the present timeslice in your current rest frame inside the region of space within the future-directed causal lightcone of your creation, and subtract the answer for the same calculation in a counterfactual alternative world-history where you had permanently shut down immediately upon being created, but the world-history was otherwise unchanged apart from future causal consequences of that divergence”. [Obviously this is a specification design problem, and the example specification above may still have bugs and/or omissions, but there will only be a finite number of these, and debugging this is an achievable goal, especially if you have a crystalographer, a geologist, and a jeweler helping you, and if a non-diamond-maximizing AI also helps by asking you probing questions. There are people whose jobs involve writing specifications like this, including in situations with opposing optimization pressure.]
As mentioned above, I fully acknowledge that this still leaves the usual Inner Alignment problems unsolved: applying Reinforcement Learning (or something similar such as Direct Preference Optimization) with this reward function to our AI, then how do we ensure that it actually becomes a diamond maximizer, rather than a biased estimator of diamond? I suspect we might want to look at some form of GAN, where the reward-estimating circuitry it not part of the Reinforcement Learning process, but is being trained in some other way. That still leaves the Inner Alignment problem of training a diamond maximizer instead of a hacker of reward model estimators.
I think if we’re fine with building an “increaser of diamonds in familiar contexts”, that’s pretty easy, and yeah I think “wrap an LLM or similar” is a promising approach. If we want “maximize diamonds, even in unfamiliar contexts”, I think that’s a harder problem, and my impression is that the MIRI folks think the latter one is the important one to solve.
What in my diamond maximization proposal above only works in familiar contexts? Most of it is (unsurprisingly) about crystalography and isotopic ratios, plus a standard causal wrapper. (If you look carefully, I even allowed for the possibility of FTL.)
The obvious “brute force” solution to aimability is a practical, approximately Bayesian, GOFAI equivalant of AIXI that is capable of tool use and contains an LLM as a tool;. This is extremely aimable — it has an explicit slot to plug a utility function in. Which makes it extremely easy to build a diamond maximizer, or a paperclip maximizer, or any other such x-risk. Then we need to instead plug in something that hopefully isn’t an x-risk, like value learning or CEV or “solve goalcraft” as the terminal goal: figure out what we want, then optimize that, while appropriately pessimizing that optimization over remaining uncertainties in “what we want”.
I don’t think “random” AI goals is a thing that will ever happen.
I think it’s much more likely that, if there are Aimability failures, they will be highly nonrandom and push AI towards various attractors (like how the behavior of dictators is surprisingly consistent across time, space and ideology)
Perhaps, or perhaps not? I might be able to design a gun which shoots bullets in random directions (not on random walks), without being able to choose the direction.
Maybe we can back up a bit, and you could give some intuition for why you expect goals to go on random walks at all?
My default picture is that goals walk around during training and perhaps during a reflective process, and then stabilise somewhere.
My intuition: imagent LLM-based agent. It has fixed prompt and some context text and use this iteratively. Context part can change and as it changes, it affects interpretation of fixed part of the prompt. Examples are Waluigi and other attacks. This causes goal drift.
This may have bad consequences as a robot suddenly turns in Waluigi and start kill randomly everyone around. But long-term planning and deceptive alignment requires very fixed goal system.
This just isn’t true. AGI is a “gun that can aim itself”. The user not being to aim it doesn’t mean it won’t aim and achieve something, quite effectively.
Less metaphorically: if the AGI performs a semi-random walk through goal space, or just misses your intended goal by enough, it may settle (even temporarily) on a coherent goal that’s incompatible with yours. It may then eliminate humanity as a competitor to its reaching that goal.
The value of AI aimability may be overblown. If AI is not aimable, its goals will perform eternal random walk and thus AI will cause only short-term risk—no risk of world takeover. (Some may comment that after random walk, it will stack in some Waluigi state forever—but if it is actually works in getting fix goal system, why we do not research such strange attractors in the space of AI goals?)
AI will become global-catastrophically-dangerous only after aimability will be solved. Research in aimability only brings this moment closer.
The wording “AI alignment” is precluding us to see this risk, as it combines aimability and giving nice goals to AI.
Yes, this is a good point. Aimability Research increases the kurtosis of the AI outcome distribution, making both the right tail (paradise) and the left tail (total annihilation) heavier, and reducing the so-so outcomes in the center.
Only Goalcrafting Research can change the relative weights.
The aspect of aimability where an AI becomes able to want something in particular consistently improves capabilities, and improved capabilities make AI matter a lot more. This might happen without ability to aim an AI where you want it aimed, another key aspect. Without the latter aspect, aimability is not “solved”, yet AIs become dangerous.
Yes, good point. We might have something like “Self Aimability” for AI before we have the ability to set the point of aim.
A gun which is not easily aimable doesn’t shoot bullets on random walks.
Or in less metaphorical language, the worry is that mostly that it’s hard to give the AI the specific goal you want to give it, not so much that it’s hard to make it have any goal at all. I think people generally expect that naively training an AGI without thinking about alignment will get you a goal-directed system, it just might not have the goal you want it to.
The practical effect of very inaccurate guns in the past was that guns mattered less and battles were often won by bayonet charges or morale. So I think it’s fair to conclude that Aimability just makes AI matter a lot more.
I think that’s a reasonable point (but fairly orthogonal to the previous commenter’s one)
At least some people are worried about the latter, for a very particular meaning of the word “goal”. From that post:
I think to some extent this is a matter of “yes, I see that you’ve solved the problem in practical terms, and yes, every time we try to implement the theoretically optimal solution it fails due to Goodharting, but we really want the theoretically optimal solution”, which is… not universally agreed, to say the least. But it is a concern some people have.
Hm, I think that paragraph is talking about the problem of getting an AI to care about a specific particular thing of your choosing (here diamond-maximising), not any arbitrary particular thing at all with no control over what it is. The MIRI-esque view thinks the former is hard and the latter happens inevitably.
I don’t think we have any way of getting an AI to “care about” any arbitrary particular thing at all, by the “attempt to maximize that thing, self-correct towards maximizing that thing if the current strategies are not working” definition of “care about”. Even if we relax the “and we pick the thing it tries to maximize” constraint.
I don’t think that that’s the view of whoever wrote the paragraph you’re quoting, but at this point we’re doing exegesis
“We don’t currently have any way of getting any system to learn to robustly optimize for any specific goal once it enters an environment very different from the one it learned in” is my own view, not Nate’s.
Like I think the MIRI folks are concerned with “how do you get an AGI to robustly maximize any specific static utility function that you choose”.
I am aware that the MIRI people think that the latter is inevitable. However, as far as I know, we don’t have even a single demonstration of “some real-world system that robustly maximizes any specific static utility function, even if that utility function was not chosen by anyone in particular”, nor do we have any particular reason to believe that such a system is practical.
And I think Nate’s comment makes it pretty clear that “robustly maximize some particular thing” is what he cares about.
To be clear: The diamond maximizer problem is about getting specific intended content into the AI’s goals (“diamonds” as opposed to some random physical structure it’s maximizing), not just about building a stable maximizer.
Thanks for the clarification!
If you relax the “specific intended content” constraint, and allow for maximizing any random physical structure, as long as it’s always the same physical structure in the real world and not just some internal metric that has historically correlated with the amount of that structure that existed in the real world, does that make the problem any easier / is there a known solution? My vague impression was that the answer was still “no, that’s also not a thing we know how to do”.
I expect it makes it easier, but I don’t think it’s solved.
So as an engineer I have trouble engaging with this as a problem.
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
“Plan the factory layout of the diamond synthesis plant with these requirements”.
“Order the equipment needed, here’s the payment credentials”.
“Supervise construction this workday comparing to original plans”
“Given this step of the plan, do it”
(Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
And so on. And any goal that isn’t something the model has empirical confidence in—because it’s in distribution for the training environment—an outer framework should block the unqualified model from attempting.
I think the problem MIRI has is this myopic model is not aware of context, and so it will do bad things sometimes. Maybe the diamonds are being cut into IC wafers and used in missiles to commit genocide.
Is that what it is? Or maybe the fear is that one of these tasks could go badly wrong? That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
I think the MIRI objection to that type of human-in-the-loop system is that it’s not optimal because sometimes such a system will have to punt back to the human, and that’s slow, and so the first effective system without a human in the loop will be vastly more effective and thus able to take over the world, hence the old “that’s safe but it doesn’t prevent someone else from destroying the world”.
So my impression is that the MIRI viewpoint is that if humanity is to survive, someone needs to solve the “disempower anyone who could destroy the world” problem, and that they have to get that right on the first try, and that’s the hard part of the “alignment” problem. But I’m not super confident that that interpretation is correct and I’m quite confident that I find different parts of that salient than people in the MIRI idea space.
Anyone who largely agrees with the MIRI viewpoint want to weigh in here?
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn’t to go build the thing; it’s that solving it is an indication that we’ve become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.
If everyone in the world chooses to permanently use very weak systems because they’re scared of AI killing them, then yes, the impact of any given system failing will stay low. But that’s not what’s going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. ‘maybe humans don’t deserve to live’, ‘if I don’t do it someone else will anyway’, ‘if it’s that easy to destroy the world then we’re fucked anyway so I should just do the Modest thing of assuming nothing I do is that important’...).
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”. I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where “aligning” includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it’s powerful.)
Thanks for the reply.
This sounds like a good and reasonable approach, and also not at all like the sort of thing where you’re trying to instill any values at all into an ML system. I would call this “usable and robust tool construction” not “AI alignment”. I expect standard business practice will look something like this: even when using LLMs in a production setting, you generally want to feed it the minimum context to get the results you want, and to have it produce outputs in some strict and usable format.
“How can I build a system powerful enough to stop everyone else from doing stuff I don’t like” sounds like more of a capabilities problem than an alignment problem.
Yeah, this sounds right to me. I expect that there’s a lot of danger inherent in biological gain-of-function research, but I don’t think the solution to that is to create a virus that will infect people and cause symptoms that include “being less likely to research dangerous pathogens”. Similarly, I don’t think “do research on how to make systems that can do their own research even faster” is a promising approach to solve the “some research results can be misused or dangerous” problem.
This is rather off-topic here, but for any AI that has an LLM as a component of it, I don’t believe diamond-maximization is a hard problem, apart from Inner Alignment problems. The LLM knows the meaning of the word ‘diamond’ (GPT-4 defined it as “Diamond is a solid form of the element carbon with its atoms arranged in a crystal structure called diamond cubic. It has the highest hardness and thermal conductivity of any natural material, properties that are utilized in major industrial applications such as cutting and polishing tools. Diamond also has high optical dispersion, making it useful in jewelry as a gemstone that can scatter light in a spectrum of colors.”). The LLM also knows its physical and optical properties, its social, industrial and financial value, its crystal structure (with images and angles and coordinates), what carbon is, its chemical properties, how many electrons, protons and neutrons a carbon atom can have, its terrestrial isotopic ratios, the half-life of carbon-14, what quarks a neutron is made of, etc. etc. etc. — where it fits in a vast network of facts about the world. Even if the AI also had some other very different internal world model and ontology, there’s only going to be one “Rosetta Stone” optimal-fit mapping between the human ontology that the LLM has a vast amount of information about and any other arbitrary ontology, so there’s more than enough information in that network of relationships to uniquely locate the concepts in that other ontology corresponding to ‘diamond’. This is still true even if the other ontology is larger and more sophisticated: for example, locating Newtonian physics in relativistic quantum field theory and mapping a setup from the former to the latter isn’t hard: its structure is very clearly just the large-scale low-speed limiting approximation.
The point where this gets a little more challenging is Outer Alignment, where you want to write a mathematical or pseudocode reward function for training a diamond optimizer using Reinforcement Learning (assuming our AI doesn’t just have a terminal goal utility function slot that we can directly connect this function to, like AIXI): then you need to also locate the concepts in the other ontology for each element in something along the lines of “pessimizingly estimate the total number of moles of diamond (having at a millimeter-scale-average any isotopic ratio of C-12 to C-13 but no more than N1 times the average terrestrial proportion of C-14, discounting any carbon atoms within N2 C-C bonds of a crystal-structure boundary, or within N3 bonds of a crystal -structure dislocation, or within N4 bonds of a lattice substitution or vacancy, etc. …) at the present timeslice in your current rest frame inside the region of space within the future-directed causal lightcone of your creation, and subtract the answer for the same calculation in a counterfactual alternative world-history where you had permanently shut down immediately upon being created, but the world-history was otherwise unchanged apart from future causal consequences of that divergence”. [Obviously this is a specification design problem, and the example specification above may still have bugs and/or omissions, but there will only be a finite number of these, and debugging this is an achievable goal, especially if you have a crystalographer, a geologist, and a jeweler helping you, and if a non-diamond-maximizing AI also helps by asking you probing questions. There are people whose jobs involve writing specifications like this, including in situations with opposing optimization pressure.]
As mentioned above, I fully acknowledge that this still leaves the usual Inner Alignment problems unsolved: applying Reinforcement Learning (or something similar such as Direct Preference Optimization) with this reward function to our AI, then how do we ensure that it actually becomes a diamond maximizer, rather than a biased estimator of diamond? I suspect we might want to look at some form of GAN, where the reward-estimating circuitry it not part of the Reinforcement Learning process, but is being trained in some other way. That still leaves the Inner Alignment problem of training a diamond maximizer instead of a hacker of reward model estimators.
I think if we’re fine with building an “increaser of diamonds in familiar contexts”, that’s pretty easy, and yeah I think “wrap an LLM or similar” is a promising approach. If we want “maximize diamonds, even in unfamiliar contexts”, I think that’s a harder problem, and my impression is that the MIRI folks think the latter one is the important one to solve.
What in my diamond maximization proposal above only works in familiar contexts? Most of it is (unsurprisingly) about crystalography and isotopic ratios, plus a standard causal wrapper. (If you look carefully, I even allowed for the possibility of FTL.)
The obvious “brute force” solution to aimability is a practical, approximately Bayesian, GOFAI equivalant of AIXI that is capable of tool use and contains an LLM as a tool;. This is extremely aimable — it has an explicit slot to plug a utility function in. Which makes it extremely easy to build a diamond maximizer, or a paperclip maximizer, or any other such x-risk. Then we need to instead plug in something that hopefully isn’t an x-risk, like value learning or CEV or “solve goalcraft” as the terminal goal: figure out what we want, then optimize that, while appropriately pessimizing that optimization over remaining uncertainties in “what we want”.
If we find that AI can stop its random walk on a goal X, we can use this as an aimability instrument, and find a way to manipulate the position of X.
I don’t think “random” AI goals is a thing that will ever happen.
I think it’s much more likely that, if there are Aimability failures, they will be highly nonrandom and push AI towards various attractors (like how the behavior of dictators is surprisingly consistent across time, space and ideology)
Perhaps, or perhaps not? I might be able to design a gun which shoots bullets in random directions (not on random walks), without being able to choose the direction.
Maybe we can back up a bit, and you could give some intuition for why you expect goals to go on random walks at all?
My default picture is that goals walk around during training and perhaps during a reflective process, and then stabilise somewhere.
My intuition: imagent LLM-based agent. It has fixed prompt and some context text and use this iteratively. Context part can change and as it changes, it affects interpretation of fixed part of the prompt. Examples are Waluigi and other attacks. This causes goal drift.
This may have bad consequences as a robot suddenly turns in Waluigi and start kill randomly everyone around. But long-term planning and deceptive alignment requires very fixed goal system.
Right, makes complete sense in the case of LLM-based agents, I guess I was just thinking about much more directly goal-trained agents.
This just isn’t true. AGI is a “gun that can aim itself”. The user not being to aim it doesn’t mean it won’t aim and achieve something, quite effectively.
Less metaphorically: if the AGI performs a semi-random walk through goal space, or just misses your intended goal by enough, it may settle (even temporarily) on a coherent goal that’s incompatible with yours. It may then eliminate humanity as a competitor to its reaching that goal.