I would expect any proof would fall into some category akin to “you can not build a program that can look at another program and tell you whether it will halt”. A weaker sort of proof would be that alignment isn’t impossible per se, but requires exponential time in the size of the model, which would make it forbiddingly difficult.
Sounds like you’re imagining that you would not try to prove “there is no AGI that will do what you want”, but instead prove “it is impossible to prove that any particular AGI will do what you want”. So aligned AIs are not impossible per se, but they are unidentifiable, and thus you can’t tell whether you’ve got one?
Well, if you can’t create on demand an AGI that does what you want, isn’t that as good as saying that alignment is impossible? But yeah, I don’t expect it’d be impossible for an AGI to do what we want—just for us to make sure it does on principle.
1) The halting problem can’t be solved in full generality, but there are still many specific programs where it is easy to prove that they will or won’t halt. In fact, approximately all actually-useful software exists within that easier subclass.
We don’t need a fully-general alignment tester; we just need one aligned AI. A halting-problem-like result wouldn’t be enough to stop that. Instead of “you can’t prove every case” it would need to be “you can’t prove any positive case”, which would be a much stronger claim. I’m not aware of any problems with results like that.
(Switching to something like “exponential time” instead of “possible” doesn’t particularly change this; we normally prove that some problem is expensive to solve in the fully-general case, but some instances of the problem can still be solved cheaply.)
2) Even if we somehow got an incredible result like that, that doesn’t rule out having some AIs that are likely aligned. I’m skeptical that “you can’t be mathematically certain this is aligned” is going to stop anyone if you can’t also rule out scenarios like “but I’m 99.9% certain”.
If you could convince the world that mathematical proof of alignment is necessary and that no one should ever launch an AGI with less assurance than that, that seems like you’ve already mostly won the policy battle even if you can’t follow that up by saying “and mathematical proof of alignment is provably impossible”. I think the doom scenarios approximately all involve someone who is willing to launch an AGI without such a proof.
Broadly agree, though I think that here the issue might be more subtle, and that it’s not that determining alignment is like solving the halting problem for a specific software—but that aligned AGI itself would need to be something generally capable of solving something like the halting problem, which is impossible.
Agree also on the fact that this probably still would leave room for an approximately aligned AGI. It then becomes a matter of how large we want our safety margins to be.
When you say that “aligned AGI” might need to solve some impossible problem in order to function at all, do you mean
Coherence is impossible; any AGI will inevitably sabotage itself
Coherent AGI can exist, but there’s some important sense in which it would not be “aligned” with anything, not even itself
You could have an AGI that is aligned with some things, but not the particular things we want to align it with, because our particular goals are hard in some special way that makes the problem impossible
You can’t have a “universally alignable” AGI that accepts an arbitrary goal as a runtime input and self-aligns to that goal
Something in between 1 and 2. Basically, that you can’t have a program that is both general enough to act reflexively on the substrate within which it is running (a Turing machine that understands it is a machine, understands the hardware it is running on, understands it can change that hardware or its own programming) and at the same time is able to guarantee sticking to any given set of values or constraints, especially if those values encompass its own behaviour (so a bit of 3, since any desirable alignment values are obviously complex enough to encompass the AGI itself).
Not sure how to formalize that precisely, but I can imagine something to that effect being true. Or even something instead like “you can not produce a proof that any given generally intelligent enough program will stick to any given constraints; it might, but you can’t know beforehand”.
I can write a simple program that modifies its own source code and then modifies it back to its original state, in a trivial loop. That’s acting on its own substrate while provably staying within extremely tight constraints. Does that qualify as a disproof of your hypothesis?
I wouldn’t say it does, any more than a program that can identify whether a very specific class of programs will halt disproves the Halting Theorem. I’m just gesturing in what I think might be the general direction of where a proof may lay; usually recursivity is where such traps hide. Obviously a rigorous proof would need rigorous definitions and all.
“A program that can identify whether a very specific class of programs will halt” does disprove the stronger analog of the Halting Theorem that (I argued above) you’d need in order for it to make alignment impossible.
Despite the existence of the halting theorem, we can still write programs that we can prove always halt. Being unable to prove the existence of some property in general does not preclude proving it in particular cases.
Though really, one of the biggest problems of alignment is that we don’t know how to formalize it. Even with a proof that we couldn’t prove that any program was formally aligned (or even that we could!), there would always remain the question of whether formal alignment has any meaningful relation to what we informally and practically mean by alignment—such as whether it’s plausible that it will take actions that extinguish humanity.
As I said elsewhere, my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans), and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don’t know which.
The definition problem to me feels more like it has to do with the greater philosophical and political issue that even if we could behold the AGI to a simple set of values, we don’t really know what those values are. I’m thinking more about the technical part because I think that’s the only one liable to be tractable. If we wanted some horrifying Pi Digit Maximizer that just spends eternity keeping calculating more digits of pi, that’s a very easily formally defined value, but we don’t know how to imbue that precisely either. However, there is an additional layer of complexity when more human values are involved in that they can’t be formalised that neatly, and so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them; or the AGI will need to guess the will of its human operators in some way. So maybe that part inside is what makes it rigorously impossible.
Anyway yeah, I expect any mathematical proof wouldn’t exclude the possibility of any alignment, not even approximate or temporary, just like you say for the halting problem. But it could at least mean that any AGI with sufficient power is potentially a ticking time bomb, and we don’t know what would set it off.
Just found your insightful comment. I’ve been thinking about this for three years. Some thoughts expanding on your ideas:
my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans)...
In other words, alignment requires sufficient control. Specifically, it requires AGI to have a control system with enough capacity to detect, model, simulate, evaluate, and correct outside effects propagated by the AGI’s own components.
… and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don’t know which.
For example, what if AGI is in some kind of convergence basin where the changing situations/conditions tend to converge outside the ranges humans can survive under?
so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them
There’s a problem you are pointing of somehow mapping the various preferences – expressed over time by diverse humans from within their (perceived) contexts – onto reference values. This involves making (irreconcilable) normative assumptions of how to map the dimensionality of the raw expressions of preferences onto internal reference values. Basically, you’re dealing with NP-complex combinatorics such as encountered with the knapsack problem.
Further, it raises the question of how to make comparisons across all the possible concrete outside effects of the machinery against the internal reference values, such to identify misalignments/errors to correct. Ie. just internalising and holding abstract values is not enough – there would have to be some robust implementation process that translates the values into concrete effects.
I would expect any proof would fall into some category akin to “you can not build a program that can look at another program and tell you whether it will halt”. A weaker sort of proof would be that alignment isn’t impossible per se, but requires exponential time in the size of the model, which would make it forbiddingly difficult.
Sounds like you’re imagining that you would not try to prove “there is no AGI that will do what you want”, but instead prove “it is impossible to prove that any particular AGI will do what you want”. So aligned AIs are not impossible per se, but they are unidentifiable, and thus you can’t tell whether you’ve got one?
Well, if you can’t create on demand an AGI that does what you want, isn’t that as good as saying that alignment is impossible? But yeah, I don’t expect it’d be impossible for an AGI to do what we want—just for us to make sure it does on principle.
A couple observations on that:
1) The halting problem can’t be solved in full generality, but there are still many specific programs where it is easy to prove that they will or won’t halt. In fact, approximately all actually-useful software exists within that easier subclass.
We don’t need a fully-general alignment tester; we just need one aligned AI. A halting-problem-like result wouldn’t be enough to stop that. Instead of “you can’t prove every case” it would need to be “you can’t prove any positive case”, which would be a much stronger claim. I’m not aware of any problems with results like that.
(Switching to something like “exponential time” instead of “possible” doesn’t particularly change this; we normally prove that some problem is expensive to solve in the fully-general case, but some instances of the problem can still be solved cheaply.)
2) Even if we somehow got an incredible result like that, that doesn’t rule out having some AIs that are likely aligned. I’m skeptical that “you can’t be mathematically certain this is aligned” is going to stop anyone if you can’t also rule out scenarios like “but I’m 99.9% certain”.
If you could convince the world that mathematical proof of alignment is necessary and that no one should ever launch an AGI with less assurance than that, that seems like you’ve already mostly won the policy battle even if you can’t follow that up by saying “and mathematical proof of alignment is provably impossible”. I think the doom scenarios approximately all involve someone who is willing to launch an AGI without such a proof.
Broadly agree, though I think that here the issue might be more subtle, and that it’s not that determining alignment is like solving the halting problem for a specific software—but that aligned AGI itself would need to be something generally capable of solving something like the halting problem, which is impossible.
Agree also on the fact that this probably still would leave room for an approximately aligned AGI. It then becomes a matter of how large we want our safety margins to be.
When you say that “aligned AGI” might need to solve some impossible problem in order to function at all, do you mean
Coherence is impossible; any AGI will inevitably sabotage itself
Coherent AGI can exist, but there’s some important sense in which it would not be “aligned” with anything, not even itself
You could have an AGI that is aligned with some things, but not the particular things we want to align it with, because our particular goals are hard in some special way that makes the problem impossible
You can’t have a “universally alignable” AGI that accepts an arbitrary goal as a runtime input and self-aligns to that goal
Something else
Something in between 1 and 2. Basically, that you can’t have a program that is both general enough to act reflexively on the substrate within which it is running (a Turing machine that understands it is a machine, understands the hardware it is running on, understands it can change that hardware or its own programming) and at the same time is able to guarantee sticking to any given set of values or constraints, especially if those values encompass its own behaviour (so a bit of 3, since any desirable alignment values are obviously complex enough to encompass the AGI itself).
Not sure how to formalize that precisely, but I can imagine something to that effect being true. Or even something instead like “you can not produce a proof that any given generally intelligent enough program will stick to any given constraints; it might, but you can’t know beforehand”.
For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo’s post Lenses of Control.
I can write a simple program that modifies its own source code and then modifies it back to its original state, in a trivial loop. That’s acting on its own substrate while provably staying within extremely tight constraints. Does that qualify as a disproof of your hypothesis?
I wouldn’t say it does, any more than a program that can identify whether a very specific class of programs will halt disproves the Halting Theorem. I’m just gesturing in what I think might be the general direction of where a proof may lay; usually recursivity is where such traps hide. Obviously a rigorous proof would need rigorous definitions and all.
“A program that can identify whether a very specific class of programs will halt” does disprove the stronger analog of the Halting Theorem that (I argued above) you’d need in order for it to make alignment impossible.
Despite the existence of the halting theorem, we can still write programs that we can prove always halt. Being unable to prove the existence of some property in general does not preclude proving it in particular cases.
Though really, one of the biggest problems of alignment is that we don’t know how to formalize it. Even with a proof that we couldn’t prove that any program was formally aligned (or even that we could!), there would always remain the question of whether formal alignment has any meaningful relation to what we informally and practically mean by alignment—such as whether it’s plausible that it will take actions that extinguish humanity.
As I said elsewhere, my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans), and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don’t know which.
The definition problem to me feels more like it has to do with the greater philosophical and political issue that even if we could behold the AGI to a simple set of values, we don’t really know what those values are. I’m thinking more about the technical part because I think that’s the only one liable to be tractable. If we wanted some horrifying Pi Digit Maximizer that just spends eternity keeping calculating more digits of pi, that’s a very easily formally defined value, but we don’t know how to imbue that precisely either. However, there is an additional layer of complexity when more human values are involved in that they can’t be formalised that neatly, and so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them; or the AGI will need to guess the will of its human operators in some way. So maybe that part inside is what makes it rigorously impossible.
Anyway yeah, I expect any mathematical proof wouldn’t exclude the possibility of any alignment, not even approximate or temporary, just like you say for the halting problem. But it could at least mean that any AGI with sufficient power is potentially a ticking time bomb, and we don’t know what would set it off.
Just found your insightful comment. I’ve been thinking about this for three years. Some thoughts expanding on your ideas:
In other words, alignment requires sufficient control. Specifically, it requires AGI to have a control system with enough capacity to detect, model, simulate, evaluate, and correct outside effects propagated by the AGI’s own components.
For example, what if AGI is in some kind of convergence basin where the changing situations/conditions tend to converge outside the ranges humans can survive under?
There’s a problem you are pointing of somehow mapping the various preferences – expressed over time by diverse humans from within their (perceived) contexts – onto reference values. This involves making (irreconcilable) normative assumptions of how to map the dimensionality of the raw expressions of preferences onto internal reference values. Basically, you’re dealing with NP-complex combinatorics such as encountered with the knapsack problem.
Further, it raises the question of how to make comparisons across all the possible concrete outside effects of the machinery against the internal reference values, such to identify misalignments/errors to correct. Ie. just internalising and holding abstract values is not enough – there would have to be some robust implementation process that translates the values into concrete effects.