Sounds like you’re imagining that you would not try to prove “there is no AGI that will do what you want”, but instead prove “it is impossible to prove that any particular AGI will do what you want”. So aligned AIs are not impossible per se, but they are unidentifiable, and thus you can’t tell whether you’ve got one?
Well, if you can’t create on demand an AGI that does what you want, isn’t that as good as saying that alignment is impossible? But yeah, I don’t expect it’d be impossible for an AGI to do what we want—just for us to make sure it does on principle.
1) The halting problem can’t be solved in full generality, but there are still many specific programs where it is easy to prove that they will or won’t halt. In fact, approximately all actually-useful software exists within that easier subclass.
We don’t need a fully-general alignment tester; we just need one aligned AI. A halting-problem-like result wouldn’t be enough to stop that. Instead of “you can’t prove every case” it would need to be “you can’t prove any positive case”, which would be a much stronger claim. I’m not aware of any problems with results like that.
(Switching to something like “exponential time” instead of “possible” doesn’t particularly change this; we normally prove that some problem is expensive to solve in the fully-general case, but some instances of the problem can still be solved cheaply.)
2) Even if we somehow got an incredible result like that, that doesn’t rule out having some AIs that are likely aligned. I’m skeptical that “you can’t be mathematically certain this is aligned” is going to stop anyone if you can’t also rule out scenarios like “but I’m 99.9% certain”.
If you could convince the world that mathematical proof of alignment is necessary and that no one should ever launch an AGI with less assurance than that, that seems like you’ve already mostly won the policy battle even if you can’t follow that up by saying “and mathematical proof of alignment is provably impossible”. I think the doom scenarios approximately all involve someone who is willing to launch an AGI without such a proof.
Broadly agree, though I think that here the issue might be more subtle, and that it’s not that determining alignment is like solving the halting problem for a specific software—but that aligned AGI itself would need to be something generally capable of solving something like the halting problem, which is impossible.
Agree also on the fact that this probably still would leave room for an approximately aligned AGI. It then becomes a matter of how large we want our safety margins to be.
When you say that “aligned AGI” might need to solve some impossible problem in order to function at all, do you mean
Coherence is impossible; any AGI will inevitably sabotage itself
Coherent AGI can exist, but there’s some important sense in which it would not be “aligned” with anything, not even itself
You could have an AGI that is aligned with some things, but not the particular things we want to align it with, because our particular goals are hard in some special way that makes the problem impossible
You can’t have a “universally alignable” AGI that accepts an arbitrary goal as a runtime input and self-aligns to that goal
Something in between 1 and 2. Basically, that you can’t have a program that is both general enough to act reflexively on the substrate within which it is running (a Turing machine that understands it is a machine, understands the hardware it is running on, understands it can change that hardware or its own programming) and at the same time is able to guarantee sticking to any given set of values or constraints, especially if those values encompass its own behaviour (so a bit of 3, since any desirable alignment values are obviously complex enough to encompass the AGI itself).
Not sure how to formalize that precisely, but I can imagine something to that effect being true. Or even something instead like “you can not produce a proof that any given generally intelligent enough program will stick to any given constraints; it might, but you can’t know beforehand”.
I can write a simple program that modifies its own source code and then modifies it back to its original state, in a trivial loop. That’s acting on its own substrate while provably staying within extremely tight constraints. Does that qualify as a disproof of your hypothesis?
I wouldn’t say it does, any more than a program that can identify whether a very specific class of programs will halt disproves the Halting Theorem. I’m just gesturing in what I think might be the general direction of where a proof may lay; usually recursivity is where such traps hide. Obviously a rigorous proof would need rigorous definitions and all.
“A program that can identify whether a very specific class of programs will halt” does disprove the stronger analog of the Halting Theorem that (I argued above) you’d need in order for it to make alignment impossible.
Sounds like you’re imagining that you would not try to prove “there is no AGI that will do what you want”, but instead prove “it is impossible to prove that any particular AGI will do what you want”. So aligned AIs are not impossible per se, but they are unidentifiable, and thus you can’t tell whether you’ve got one?
Well, if you can’t create on demand an AGI that does what you want, isn’t that as good as saying that alignment is impossible? But yeah, I don’t expect it’d be impossible for an AGI to do what we want—just for us to make sure it does on principle.
A couple observations on that:
1) The halting problem can’t be solved in full generality, but there are still many specific programs where it is easy to prove that they will or won’t halt. In fact, approximately all actually-useful software exists within that easier subclass.
We don’t need a fully-general alignment tester; we just need one aligned AI. A halting-problem-like result wouldn’t be enough to stop that. Instead of “you can’t prove every case” it would need to be “you can’t prove any positive case”, which would be a much stronger claim. I’m not aware of any problems with results like that.
(Switching to something like “exponential time” instead of “possible” doesn’t particularly change this; we normally prove that some problem is expensive to solve in the fully-general case, but some instances of the problem can still be solved cheaply.)
2) Even if we somehow got an incredible result like that, that doesn’t rule out having some AIs that are likely aligned. I’m skeptical that “you can’t be mathematically certain this is aligned” is going to stop anyone if you can’t also rule out scenarios like “but I’m 99.9% certain”.
If you could convince the world that mathematical proof of alignment is necessary and that no one should ever launch an AGI with less assurance than that, that seems like you’ve already mostly won the policy battle even if you can’t follow that up by saying “and mathematical proof of alignment is provably impossible”. I think the doom scenarios approximately all involve someone who is willing to launch an AGI without such a proof.
Broadly agree, though I think that here the issue might be more subtle, and that it’s not that determining alignment is like solving the halting problem for a specific software—but that aligned AGI itself would need to be something generally capable of solving something like the halting problem, which is impossible.
Agree also on the fact that this probably still would leave room for an approximately aligned AGI. It then becomes a matter of how large we want our safety margins to be.
When you say that “aligned AGI” might need to solve some impossible problem in order to function at all, do you mean
Coherence is impossible; any AGI will inevitably sabotage itself
Coherent AGI can exist, but there’s some important sense in which it would not be “aligned” with anything, not even itself
You could have an AGI that is aligned with some things, but not the particular things we want to align it with, because our particular goals are hard in some special way that makes the problem impossible
You can’t have a “universally alignable” AGI that accepts an arbitrary goal as a runtime input and self-aligns to that goal
Something else
Something in between 1 and 2. Basically, that you can’t have a program that is both general enough to act reflexively on the substrate within which it is running (a Turing machine that understands it is a machine, understands the hardware it is running on, understands it can change that hardware or its own programming) and at the same time is able to guarantee sticking to any given set of values or constraints, especially if those values encompass its own behaviour (so a bit of 3, since any desirable alignment values are obviously complex enough to encompass the AGI itself).
Not sure how to formalize that precisely, but I can imagine something to that effect being true. Or even something instead like “you can not produce a proof that any given generally intelligent enough program will stick to any given constraints; it might, but you can’t know beforehand”.
For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo’s post Lenses of Control.
I can write a simple program that modifies its own source code and then modifies it back to its original state, in a trivial loop. That’s acting on its own substrate while provably staying within extremely tight constraints. Does that qualify as a disproof of your hypothesis?
I wouldn’t say it does, any more than a program that can identify whether a very specific class of programs will halt disproves the Halting Theorem. I’m just gesturing in what I think might be the general direction of where a proof may lay; usually recursivity is where such traps hide. Obviously a rigorous proof would need rigorous definitions and all.
“A program that can identify whether a very specific class of programs will halt” does disprove the stronger analog of the Halting Theorem that (I argued above) you’d need in order for it to make alignment impossible.