It is possible to define the alignment problem without using such fuzzy concepts as “happiness” or “value”.
For example, there are two agents: R and H. The agent R can do some actions.
The agent H prefers some of the R’s actions over other actions. For example, H prefers the action make_pie to the action kill_all_humans.
Some of the preferences are unknown even to H itself (e.g. if it prefers pierogi to borscht).
Among other things, the set of the R’s actions includes:
ask_h_which_of_the_actions_is_preferable
infer_preferences_from_the_behavior_of_h
explain_consequences_of_the_action_to_h
switch_itself_off
In any given situation, the perfect agent R always chooses the most preferable action (according to H). The goal is to create an agent that is as close to the perfect R as possible.
Of course, this formalism is incomplete. But i think it demonstrates that the alignment problem can be framed as a technical problem without delving into metaphysics.
If you replace “value” with “preference” in what I wrote, I believe that it all still applies.
If you both “ask H about the preferable action” and “infer H’s preferences from the behavior of H”, then what do you do when the two yield different answers? That’s not a technical question; you could technically choose either one or even try to “average” them somehow. And it will happen.
The same applies if you have to deal with two humans, H1 and H2; they are sometimes going to disagree. How do you choose then?
There are also technical problems with both of those, and they’re the kind of technical problems I was talking about that feed back on the philosophical choices. You might start with one philosophical position, then want to change when you saw the technical results.
For the first:
It assumes that H’s “real” preferences comport with what H says. That isn’t a given, because “preference” is just as hard to define as “value”. Choosing to ask H really amounts to defining preference to mean “stated preference”.
It also assumes that H will always be able to state a preference, will be able to do so in a way that you can correctly understand, and will not be unduly ambivalent about it.
You’d probably also prefer that H (or somebody else...) not regret that preference if it gets enacted. You’d probably like to have some ability to predict that H is going to get unintended consequences, and at least give H more information before going ahead. That’s an extra feature not implied by a technical specification based on just doing whatever H says.
Related to (3), it assumes that H can usefully state preferences about courses of action more complicated than H could plan, when the consequences themselves may be more complicated than H can understand. And you yourself may have very complicated forms of uncertainty about those consequences, which makes it all the harder to explain the whole thing to H.
All of that is pretty unlikely.
The second is worse:
It assumes that that H’s actions always reflect H’s preferences, which amounts to adopting a different definition of “preference”, probably even further from the common meaning.
H’s preferences aren’t required to be any simpler or more regular than a list of every possible individual situation, with a preferred course of action for each one independent of all others. For that matter, the list is allowed to change, or be dependent on when some particular circumstances occur, or include “never do the same thing twice in the same circumstances”. Even if H’s behavior is assumed to reflect H’s preferences, theres still nothing that says H has to have an inferrable set of preferences.
To make inferences about H’s preferences, you first have to make a leap of faith and assume that they’re simple enough, compact enough, and consistent enough to be “closely enough” approximated by any set of rules you can infer. That is a non-technical leap of faith. And there’s a very good chance that it would be the wrong leap to make.
It assumes that the rules you can infer from H’s behavior are reasonably prescriptive about the choices you might have to make. Your action space may be far beyond anything H could do, and the choices you have to make may be far beyond anything H could understand.
So you end up taking a bunch of at best approximate inferences about H’s existing preferences, and trying to use them to figure out “What would H do if H were not a human, but in fact some kind of superhuman AGI totally unlike a human, but were somehow still H?”. That’s probably not a reasonable question to ask.
Oh, one more thing I should probably add: it gets even more interesting when you ask whether the AGI might act to change the human’s value (or preferences; there’s really no difference and both are equally “fuzzy” concepts). Any action that affects the human at all is likely to have some effect on them, and some actions could be targeted to have very large effects.
I agree, you’ve listed some very valid concerns about my half-backed formalism.
As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics.
The formalism doesn’t have to be perfect. If our theoretical R makes its decisions according to the best possible approximate inferences about H’s existing preferences, then the R is much better than rogue AGI. Even if sometimes it will make deadly mistakes. Any improvement over rogue AGI is a good improvement.
Compare: the Tesla AI sometimes causes deadly crashes. Yet the Tesla AI is much better than the status quo, as its net effect are thousands of saved lives.
And after we have a decent formalism, we can build a better formalism from it, and then repeat and repeat.
As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics.
Nobody’s even gotten close to metaphysics. Ethics or even epistemology, OK. Metaphysics, no. The reason I’m getting pedantic about the technical meaning of the word is that “metaphysics”, when used non-technically, is often a tag word used for “all that complicated, badly-understood stuff that might interfere with bulling ahead”.
My narrow point is that alignment isn’t a technical problem until you already have an adequate final formalism. Creating the formalism itself isn’t an entirely technical process.
If you’re talking about inferring, learning, being instructed about, or actually carrying out human preferences, values, or paths to a “good outcome”, then as far as I know nobody has an approximately adequate formalism, and nobody has a formalism with any clear path to be extended to adequacy, or even any clear hope of it. I’ve seen proposals, but none of them have stood up to 15 minutes of thought. I don’t follow it all the time; maybe I’ve missed something.
In fact, even asking for an “adequate” formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use. There’s no clear statement of what that would mean.
My broader concern is that I’m unsure an adequate list of meta-criteria can be established, and that I’m even less sure that the base formalism can exist at all. Demanding a formal system that can’t be achieved can lead to all kinds of bad outcomes, many of them related to erroneously pretending that a formalism you have usefully approximates the formalism you need.
It would be very easy to decide that, for the sake of “avoiding metaphysics”, it was important to adopt, agree upon, and stay within a certain framework—one that did not meet meta-criteria like “allows you to express constraints that assure that everybody doesn’t end up worse than dead”, let alone “allows you to express what it means to achieve the maximum benefit from AGI”, or “must provide prescriptions implementable in actual software”.
Oh, people would keep tweaking any given framework to to cover more edge cases, and squeeze more and more looseness out of some defintions, and play around with more and more elegant statements of the whole thing… but that could just be a nice distraction from the fundamental lack of any “no fate worse than death” guarantee anywhere in it.
A useful formalism does have to be perfect in achieving no fates worse than death, or no widespread fates worse than death. It has to define fates worse than death in a meaningful way that doesn’t sacrifice the motivation for having the constraint in the first place. It has to achieve that over all possible fates worse than death, including ones nobody has thought of yet. It has to let you at least approximately exclude at least the widespread occurrence of anything that almost anybody would think was a fate worse than death. Ideally while also enabling you to actually get positive benefits from your AGI.
And formal frameworks are often brittle; a formalism that doesn’t guarantee perfection does not necessarily even avert catastrophe. If you make a small mistake in defining “fate worse than death”, that may lead to a very large prevalence of the case you missed.
It’s not even true that “the best possible inferences” are necessarily better than nothing, let alone adequate in any absolute sense. In fact, a truly rogue AGI that doesn’t care about you at all seems more likely to just kill you quickly, whereas who knows what a buggy AGI that was interested in your fate might choose to do...
The very adoption of the word “aligment” seems to be a symptom of a desire to at least appear to move toward formalizing, without the change actually tending to improve the chances of a good outcome. I think people were trying to tighten up from “good outcome” when they adopted “alignment”, but actually I don’t think it helped. The connotations of the word “alignment” tend to concentrate attention on approaches that rely on humans to know what they want, or at least to have coherent desires, which isn’t necessarily a good idea at all. On the other hand, the switch doesn’t actually seem to make it any easier to design formal structures or technical approaches that will actually lead to good software behavior. It’s still vague in all the ways that matter, and it doesn’t seem to be improving at all.
To create a perfect AI for self-driving, one first must resolve all that complicated, badly-understood stuff that might interfere with bulling ahead. For example, if the car should prefer the driver’s life over the pedestrian’s life.
But while we contemplate such questions, we lose tens of thousands of lives in car crashes per year.
The people of Tesla made the rational decision of bulling aheadinstead. As their AI is not perfect, sometimes it makes decisions with deadly consequences. But in total, it saves lives.
Their AI has an imperfect but good enough formalism. AFAIK, it’s something that could be described in English as “drive to the destination without breaking the driving regulations, while minimizing the number of crashes”, or something like this.
As their AI is net saving lives, it means their formalism is indeed good enough. They have successfully reduced a complex ethical/societal problem to a purely technical problem.
Rogue AGI is very likely to kill all humans. Any better-than-rogue-AGI is an improvement, even if it doesn’t fully understand the complicated and ever changing human preferences, and even if some people will suffer as a result.
Even my half-backed sketch of a formalism, if implemented, will produce an AI that is better than rogue AGI, in spite of the many problems you listed. Thus, working on it is better than waiting for the certain death.
In fact, even asking for an “adequate” formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use
A formalism that saves more lives is better than the one that saves less lives. That’s good enough for a start.
If you’re trying to solve a hard problem, start with something simple and then iteratively improve over it. This includes meta-criteria.
fate worse than death
I strongly believe that there is no such a thing. Explained it in detail here.
I agree with your sketch of the alignment problem.
But once you move past the sketch stage the solutions depend heavily on the structure of A, which is why I questioned Rob’s dismissal of the now-dominant non-MIRI safety approaches (which are naturally more connectivist/DL friendly).
It is possible to define the alignment problem without using such fuzzy concepts as “happiness” or “value”.
For example, there are two agents: R and H. The agent R can do some actions.
The agent H prefers some of the R’s actions over other actions. For example, H prefers the action
make_pie
to the actionkill_all_humans
.Some of the preferences are unknown even to H itself (e.g. if it prefers
pierogi
toborscht
).Among other things, the set of the R’s actions includes:
ask_h_which_of_the_actions_is_preferable
infer_preferences_from_the_behavior_of_h
explain_consequences_of_the_action_to_h
switch_itself_off
In any given situation, the perfect agent R always chooses the most preferable action (according to H). The goal is to create an agent that is as close to the perfect R as possible.
Of course, this formalism is incomplete. But i think it demonstrates that the alignment problem can be framed as a technical problem without delving into metaphysics.
If you replace “value” with “preference” in what I wrote, I believe that it all still applies.
If you both “ask H about the preferable action” and “infer H’s preferences from the behavior of H”, then what do you do when the two yield different answers? That’s not a technical question; you could technically choose either one or even try to “average” them somehow. And it will happen.
The same applies if you have to deal with two humans, H1 and H2; they are sometimes going to disagree. How do you choose then?
There are also technical problems with both of those, and they’re the kind of technical problems I was talking about that feed back on the philosophical choices. You might start with one philosophical position, then want to change when you saw the technical results.
For the first:
It assumes that H’s “real” preferences comport with what H says. That isn’t a given, because “preference” is just as hard to define as “value”. Choosing to ask H really amounts to defining preference to mean “stated preference”.
It also assumes that H will always be able to state a preference, will be able to do so in a way that you can correctly understand, and will not be unduly ambivalent about it.
You’d probably also prefer that H (or somebody else...) not regret that preference if it gets enacted. You’d probably like to have some ability to predict that H is going to get unintended consequences, and at least give H more information before going ahead. That’s an extra feature not implied by a technical specification based on just doing whatever H says.
Related to (3), it assumes that H can usefully state preferences about courses of action more complicated than H could plan, when the consequences themselves may be more complicated than H can understand. And you yourself may have very complicated forms of uncertainty about those consequences, which makes it all the harder to explain the whole thing to H.
All of that is pretty unlikely.
The second is worse:
It assumes that that H’s actions always reflect H’s preferences, which amounts to adopting a different definition of “preference”, probably even further from the common meaning.
H’s preferences aren’t required to be any simpler or more regular than a list of every possible individual situation, with a preferred course of action for each one independent of all others. For that matter, the list is allowed to change, or be dependent on when some particular circumstances occur, or include “never do the same thing twice in the same circumstances”. Even if H’s behavior is assumed to reflect H’s preferences, theres still nothing that says H has to have an inferrable set of preferences.
To make inferences about H’s preferences, you first have to make a leap of faith and assume that they’re simple enough, compact enough, and consistent enough to be “closely enough” approximated by any set of rules you can infer. That is a non-technical leap of faith. And there’s a very good chance that it would be the wrong leap to make.
It assumes that the rules you can infer from H’s behavior are reasonably prescriptive about the choices you might have to make. Your action space may be far beyond anything H could do, and the choices you have to make may be far beyond anything H could understand.
So you end up taking a bunch of at best approximate inferences about H’s existing preferences, and trying to use them to figure out “What would H do if H were not a human, but in fact some kind of superhuman AGI totally unlike a human, but were somehow still H?”. That’s probably not a reasonable question to ask.
Oh, one more thing I should probably add: it gets even more interesting when you ask whether the AGI might act to change the human’s value (or preferences; there’s really no difference and both are equally “fuzzy” concepts). Any action that affects the human at all is likely to have some effect on them, and some actions could be targeted to have very large effects.
I agree, you’ve listed some very valid concerns about my half-backed formalism.
As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics.
The formalism doesn’t have to be perfect. If our theoretical R makes its decisions according to the best possible approximate inferences about H’s existing preferences, then the R is much better than rogue AGI. Even if sometimes it will make deadly mistakes. Any improvement over rogue AGI is a good improvement.
Compare: the Tesla AI sometimes causes deadly crashes. Yet the Tesla AI is much better than the status quo, as its net effect are thousands of saved lives.
And after we have a decent formalism, we can build a better formalism from it, and then repeat and repeat.
Nobody’s even gotten close to metaphysics. Ethics or even epistemology, OK. Metaphysics, no. The reason I’m getting pedantic about the technical meaning of the word is that “metaphysics”, when used non-technically, is often a tag word used for “all that complicated, badly-understood stuff that might interfere with bulling ahead”.
My narrow point is that alignment isn’t a technical problem until you already have an adequate final formalism. Creating the formalism itself isn’t an entirely technical process.
If you’re talking about inferring, learning, being instructed about, or actually carrying out human preferences, values, or paths to a “good outcome”, then as far as I know nobody has an approximately adequate formalism, and nobody has a formalism with any clear path to be extended to adequacy, or even any clear hope of it. I’ve seen proposals, but none of them have stood up to 15 minutes of thought. I don’t follow it all the time; maybe I’ve missed something.
In fact, even asking for an “adequate” formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use. There’s no clear statement of what that would mean.
My broader concern is that I’m unsure an adequate list of meta-criteria can be established, and that I’m even less sure that the base formalism can exist at all. Demanding a formal system that can’t be achieved can lead to all kinds of bad outcomes, many of them related to erroneously pretending that a formalism you have usefully approximates the formalism you need.
It would be very easy to decide that, for the sake of “avoiding metaphysics”, it was important to adopt, agree upon, and stay within a certain framework—one that did not meet meta-criteria like “allows you to express constraints that assure that everybody doesn’t end up worse than dead”, let alone “allows you to express what it means to achieve the maximum benefit from AGI”, or “must provide prescriptions implementable in actual software”.
Oh, people would keep tweaking any given framework to to cover more edge cases, and squeeze more and more looseness out of some defintions, and play around with more and more elegant statements of the whole thing… but that could just be a nice distraction from the fundamental lack of any “no fate worse than death” guarantee anywhere in it.
A useful formalism does have to be perfect in achieving no fates worse than death, or no widespread fates worse than death. It has to define fates worse than death in a meaningful way that doesn’t sacrifice the motivation for having the constraint in the first place. It has to achieve that over all possible fates worse than death, including ones nobody has thought of yet. It has to let you at least approximately exclude at least the widespread occurrence of anything that almost anybody would think was a fate worse than death. Ideally while also enabling you to actually get positive benefits from your AGI.
And formal frameworks are often brittle; a formalism that doesn’t guarantee perfection does not necessarily even avert catastrophe. If you make a small mistake in defining “fate worse than death”, that may lead to a very large prevalence of the case you missed.
It’s not even true that “the best possible inferences” are necessarily better than nothing, let alone adequate in any absolute sense. In fact, a truly rogue AGI that doesn’t care about you at all seems more likely to just kill you quickly, whereas who knows what a buggy AGI that was interested in your fate might choose to do...
The very adoption of the word “aligment” seems to be a symptom of a desire to at least appear to move toward formalizing, without the change actually tending to improve the chances of a good outcome. I think people were trying to tighten up from “good outcome” when they adopted “alignment”, but actually I don’t think it helped. The connotations of the word “alignment” tend to concentrate attention on approaches that rely on humans to know what they want, or at least to have coherent desires, which isn’t necessarily a good idea at all. On the other hand, the switch doesn’t actually seem to make it any easier to design formal structures or technical approaches that will actually lead to good software behavior. It’s still vague in all the ways that matter, and it doesn’t seem to be improving at all.
We could use the Tesla AI as a model.
To create a perfect AI for self-driving, one first must resolve all that complicated, badly-understood stuff that might interfere with bulling ahead. For example, if the car should prefer the driver’s life over the pedestrian’s life.
But while we contemplate such questions, we lose tens of thousands of lives in car crashes per year.
The people of Tesla made the rational decision of bulling ahead instead. As their AI is not perfect, sometimes it makes decisions with deadly consequences. But in total, it saves lives.
Their AI has an imperfect but good enough formalism. AFAIK, it’s something that could be described in English as “drive to the destination without breaking the driving regulations, while minimizing the number of crashes”, or something like this.
As their AI is net saving lives, it means their formalism is indeed good enough. They have successfully reduced a complex ethical/societal problem to a purely technical problem.
Rogue AGI is very likely to kill all humans. Any better-than-rogue-AGI is an improvement, even if it doesn’t fully understand the complicated and ever changing human preferences, and even if some people will suffer as a result.
Even my half-backed sketch of a formalism, if implemented, will produce an AI that is better than rogue AGI, in spite of the many problems you listed. Thus, working on it is better than waiting for the certain death.
A formalism that saves more lives is better than the one that saves less lives. That’s good enough for a start.
If you’re trying to solve a hard problem, start with something simple and then iteratively improve over it. This includes meta-criteria.
I strongly believe that there is no such a thing. Explained it in detail here.
I agree with your sketch of the alignment problem.
But once you move past the sketch stage the solutions depend heavily on the structure of A, which is why I questioned Rob’s dismissal of the now-dominant non-MIRI safety approaches (which are naturally more connectivist/DL friendly).