I agree, you’ve listed some very valid concerns about my half-backed formalism.
As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics.
The formalism doesn’t have to be perfect. If our theoretical R makes its decisions according to the best possible approximate inferences about H’s existing preferences, then the R is much better than rogue AGI. Even if sometimes it will make deadly mistakes. Any improvement over rogue AGI is a good improvement.
Compare: the Tesla AI sometimes causes deadly crashes. Yet the Tesla AI is much better than the status quo, as its net effect are thousands of saved lives.
And after we have a decent formalism, we can build a better formalism from it, and then repeat and repeat.
As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics.
Nobody’s even gotten close to metaphysics. Ethics or even epistemology, OK. Metaphysics, no. The reason I’m getting pedantic about the technical meaning of the word is that “metaphysics”, when used non-technically, is often a tag word used for “all that complicated, badly-understood stuff that might interfere with bulling ahead”.
My narrow point is that alignment isn’t a technical problem until you already have an adequate final formalism. Creating the formalism itself isn’t an entirely technical process.
If you’re talking about inferring, learning, being instructed about, or actually carrying out human preferences, values, or paths to a “good outcome”, then as far as I know nobody has an approximately adequate formalism, and nobody has a formalism with any clear path to be extended to adequacy, or even any clear hope of it. I’ve seen proposals, but none of them have stood up to 15 minutes of thought. I don’t follow it all the time; maybe I’ve missed something.
In fact, even asking for an “adequate” formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use. There’s no clear statement of what that would mean.
My broader concern is that I’m unsure an adequate list of meta-criteria can be established, and that I’m even less sure that the base formalism can exist at all. Demanding a formal system that can’t be achieved can lead to all kinds of bad outcomes, many of them related to erroneously pretending that a formalism you have usefully approximates the formalism you need.
It would be very easy to decide that, for the sake of “avoiding metaphysics”, it was important to adopt, agree upon, and stay within a certain framework—one that did not meet meta-criteria like “allows you to express constraints that assure that everybody doesn’t end up worse than dead”, let alone “allows you to express what it means to achieve the maximum benefit from AGI”, or “must provide prescriptions implementable in actual software”.
Oh, people would keep tweaking any given framework to to cover more edge cases, and squeeze more and more looseness out of some defintions, and play around with more and more elegant statements of the whole thing… but that could just be a nice distraction from the fundamental lack of any “no fate worse than death” guarantee anywhere in it.
A useful formalism does have to be perfect in achieving no fates worse than death, or no widespread fates worse than death. It has to define fates worse than death in a meaningful way that doesn’t sacrifice the motivation for having the constraint in the first place. It has to achieve that over all possible fates worse than death, including ones nobody has thought of yet. It has to let you at least approximately exclude at least the widespread occurrence of anything that almost anybody would think was a fate worse than death. Ideally while also enabling you to actually get positive benefits from your AGI.
And formal frameworks are often brittle; a formalism that doesn’t guarantee perfection does not necessarily even avert catastrophe. If you make a small mistake in defining “fate worse than death”, that may lead to a very large prevalence of the case you missed.
It’s not even true that “the best possible inferences” are necessarily better than nothing, let alone adequate in any absolute sense. In fact, a truly rogue AGI that doesn’t care about you at all seems more likely to just kill you quickly, whereas who knows what a buggy AGI that was interested in your fate might choose to do...
The very adoption of the word “aligment” seems to be a symptom of a desire to at least appear to move toward formalizing, without the change actually tending to improve the chances of a good outcome. I think people were trying to tighten up from “good outcome” when they adopted “alignment”, but actually I don’t think it helped. The connotations of the word “alignment” tend to concentrate attention on approaches that rely on humans to know what they want, or at least to have coherent desires, which isn’t necessarily a good idea at all. On the other hand, the switch doesn’t actually seem to make it any easier to design formal structures or technical approaches that will actually lead to good software behavior. It’s still vague in all the ways that matter, and it doesn’t seem to be improving at all.
To create a perfect AI for self-driving, one first must resolve all that complicated, badly-understood stuff that might interfere with bulling ahead. For example, if the car should prefer the driver’s life over the pedestrian’s life.
But while we contemplate such questions, we lose tens of thousands of lives in car crashes per year.
The people of Tesla made the rational decision of bulling aheadinstead. As their AI is not perfect, sometimes it makes decisions with deadly consequences. But in total, it saves lives.
Their AI has an imperfect but good enough formalism. AFAIK, it’s something that could be described in English as “drive to the destination without breaking the driving regulations, while minimizing the number of crashes”, or something like this.
As their AI is net saving lives, it means their formalism is indeed good enough. They have successfully reduced a complex ethical/societal problem to a purely technical problem.
Rogue AGI is very likely to kill all humans. Any better-than-rogue-AGI is an improvement, even if it doesn’t fully understand the complicated and ever changing human preferences, and even if some people will suffer as a result.
Even my half-backed sketch of a formalism, if implemented, will produce an AI that is better than rogue AGI, in spite of the many problems you listed. Thus, working on it is better than waiting for the certain death.
In fact, even asking for an “adequate” formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use
A formalism that saves more lives is better than the one that saves less lives. That’s good enough for a start.
If you’re trying to solve a hard problem, start with something simple and then iteratively improve over it. This includes meta-criteria.
fate worse than death
I strongly believe that there is no such a thing. Explained it in detail here.
I agree, you’ve listed some very valid concerns about my half-backed formalism.
As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics.
The formalism doesn’t have to be perfect. If our theoretical R makes its decisions according to the best possible approximate inferences about H’s existing preferences, then the R is much better than rogue AGI. Even if sometimes it will make deadly mistakes. Any improvement over rogue AGI is a good improvement.
Compare: the Tesla AI sometimes causes deadly crashes. Yet the Tesla AI is much better than the status quo, as its net effect are thousands of saved lives.
And after we have a decent formalism, we can build a better formalism from it, and then repeat and repeat.
Nobody’s even gotten close to metaphysics. Ethics or even epistemology, OK. Metaphysics, no. The reason I’m getting pedantic about the technical meaning of the word is that “metaphysics”, when used non-technically, is often a tag word used for “all that complicated, badly-understood stuff that might interfere with bulling ahead”.
My narrow point is that alignment isn’t a technical problem until you already have an adequate final formalism. Creating the formalism itself isn’t an entirely technical process.
If you’re talking about inferring, learning, being instructed about, or actually carrying out human preferences, values, or paths to a “good outcome”, then as far as I know nobody has an approximately adequate formalism, and nobody has a formalism with any clear path to be extended to adequacy, or even any clear hope of it. I’ve seen proposals, but none of them have stood up to 15 minutes of thought. I don’t follow it all the time; maybe I’ve missed something.
In fact, even asking for an “adequate” formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use. There’s no clear statement of what that would mean.
My broader concern is that I’m unsure an adequate list of meta-criteria can be established, and that I’m even less sure that the base formalism can exist at all. Demanding a formal system that can’t be achieved can lead to all kinds of bad outcomes, many of them related to erroneously pretending that a formalism you have usefully approximates the formalism you need.
It would be very easy to decide that, for the sake of “avoiding metaphysics”, it was important to adopt, agree upon, and stay within a certain framework—one that did not meet meta-criteria like “allows you to express constraints that assure that everybody doesn’t end up worse than dead”, let alone “allows you to express what it means to achieve the maximum benefit from AGI”, or “must provide prescriptions implementable in actual software”.
Oh, people would keep tweaking any given framework to to cover more edge cases, and squeeze more and more looseness out of some defintions, and play around with more and more elegant statements of the whole thing… but that could just be a nice distraction from the fundamental lack of any “no fate worse than death” guarantee anywhere in it.
A useful formalism does have to be perfect in achieving no fates worse than death, or no widespread fates worse than death. It has to define fates worse than death in a meaningful way that doesn’t sacrifice the motivation for having the constraint in the first place. It has to achieve that over all possible fates worse than death, including ones nobody has thought of yet. It has to let you at least approximately exclude at least the widespread occurrence of anything that almost anybody would think was a fate worse than death. Ideally while also enabling you to actually get positive benefits from your AGI.
And formal frameworks are often brittle; a formalism that doesn’t guarantee perfection does not necessarily even avert catastrophe. If you make a small mistake in defining “fate worse than death”, that may lead to a very large prevalence of the case you missed.
It’s not even true that “the best possible inferences” are necessarily better than nothing, let alone adequate in any absolute sense. In fact, a truly rogue AGI that doesn’t care about you at all seems more likely to just kill you quickly, whereas who knows what a buggy AGI that was interested in your fate might choose to do...
The very adoption of the word “aligment” seems to be a symptom of a desire to at least appear to move toward formalizing, without the change actually tending to improve the chances of a good outcome. I think people were trying to tighten up from “good outcome” when they adopted “alignment”, but actually I don’t think it helped. The connotations of the word “alignment” tend to concentrate attention on approaches that rely on humans to know what they want, or at least to have coherent desires, which isn’t necessarily a good idea at all. On the other hand, the switch doesn’t actually seem to make it any easier to design formal structures or technical approaches that will actually lead to good software behavior. It’s still vague in all the ways that matter, and it doesn’t seem to be improving at all.
We could use the Tesla AI as a model.
To create a perfect AI for self-driving, one first must resolve all that complicated, badly-understood stuff that might interfere with bulling ahead. For example, if the car should prefer the driver’s life over the pedestrian’s life.
But while we contemplate such questions, we lose tens of thousands of lives in car crashes per year.
The people of Tesla made the rational decision of bulling ahead instead. As their AI is not perfect, sometimes it makes decisions with deadly consequences. But in total, it saves lives.
Their AI has an imperfect but good enough formalism. AFAIK, it’s something that could be described in English as “drive to the destination without breaking the driving regulations, while minimizing the number of crashes”, or something like this.
As their AI is net saving lives, it means their formalism is indeed good enough. They have successfully reduced a complex ethical/societal problem to a purely technical problem.
Rogue AGI is very likely to kill all humans. Any better-than-rogue-AGI is an improvement, even if it doesn’t fully understand the complicated and ever changing human preferences, and even if some people will suffer as a result.
Even my half-backed sketch of a formalism, if implemented, will produce an AI that is better than rogue AGI, in spite of the many problems you listed. Thus, working on it is better than waiting for the certain death.
A formalism that saves more lives is better than the one that saves less lives. That’s good enough for a start.
If you’re trying to solve a hard problem, start with something simple and then iteratively improve over it. This includes meta-criteria.
I strongly believe that there is no such a thing. Explained it in detail here.