Broadly agree, though I think that here the issue might be more subtle, and that it’s not that determining alignment is like solving the halting problem for a specific software—but that aligned AGI itself would need to be something generally capable of solving something like the halting problem, which is impossible.
Agree also on the fact that this probably still would leave room for an approximately aligned AGI. It then becomes a matter of how large we want our safety margins to be.
When you say that “aligned AGI” might need to solve some impossible problem in order to function at all, do you mean
Coherence is impossible; any AGI will inevitably sabotage itself
Coherent AGI can exist, but there’s some important sense in which it would not be “aligned” with anything, not even itself
You could have an AGI that is aligned with some things, but not the particular things we want to align it with, because our particular goals are hard in some special way that makes the problem impossible
You can’t have a “universally alignable” AGI that accepts an arbitrary goal as a runtime input and self-aligns to that goal
Something in between 1 and 2. Basically, that you can’t have a program that is both general enough to act reflexively on the substrate within which it is running (a Turing machine that understands it is a machine, understands the hardware it is running on, understands it can change that hardware or its own programming) and at the same time is able to guarantee sticking to any given set of values or constraints, especially if those values encompass its own behaviour (so a bit of 3, since any desirable alignment values are obviously complex enough to encompass the AGI itself).
Not sure how to formalize that precisely, but I can imagine something to that effect being true. Or even something instead like “you can not produce a proof that any given generally intelligent enough program will stick to any given constraints; it might, but you can’t know beforehand”.
I can write a simple program that modifies its own source code and then modifies it back to its original state, in a trivial loop. That’s acting on its own substrate while provably staying within extremely tight constraints. Does that qualify as a disproof of your hypothesis?
I wouldn’t say it does, any more than a program that can identify whether a very specific class of programs will halt disproves the Halting Theorem. I’m just gesturing in what I think might be the general direction of where a proof may lay; usually recursivity is where such traps hide. Obviously a rigorous proof would need rigorous definitions and all.
“A program that can identify whether a very specific class of programs will halt” does disprove the stronger analog of the Halting Theorem that (I argued above) you’d need in order for it to make alignment impossible.
Broadly agree, though I think that here the issue might be more subtle, and that it’s not that determining alignment is like solving the halting problem for a specific software—but that aligned AGI itself would need to be something generally capable of solving something like the halting problem, which is impossible.
Agree also on the fact that this probably still would leave room for an approximately aligned AGI. It then becomes a matter of how large we want our safety margins to be.
When you say that “aligned AGI” might need to solve some impossible problem in order to function at all, do you mean
Coherence is impossible; any AGI will inevitably sabotage itself
Coherent AGI can exist, but there’s some important sense in which it would not be “aligned” with anything, not even itself
You could have an AGI that is aligned with some things, but not the particular things we want to align it with, because our particular goals are hard in some special way that makes the problem impossible
You can’t have a “universally alignable” AGI that accepts an arbitrary goal as a runtime input and self-aligns to that goal
Something else
Something in between 1 and 2. Basically, that you can’t have a program that is both general enough to act reflexively on the substrate within which it is running (a Turing machine that understands it is a machine, understands the hardware it is running on, understands it can change that hardware or its own programming) and at the same time is able to guarantee sticking to any given set of values or constraints, especially if those values encompass its own behaviour (so a bit of 3, since any desirable alignment values are obviously complex enough to encompass the AGI itself).
Not sure how to formalize that precisely, but I can imagine something to that effect being true. Or even something instead like “you can not produce a proof that any given generally intelligent enough program will stick to any given constraints; it might, but you can’t know beforehand”.
For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo’s post Lenses of Control.
I can write a simple program that modifies its own source code and then modifies it back to its original state, in a trivial loop. That’s acting on its own substrate while provably staying within extremely tight constraints. Does that qualify as a disproof of your hypothesis?
I wouldn’t say it does, any more than a program that can identify whether a very specific class of programs will halt disproves the Halting Theorem. I’m just gesturing in what I think might be the general direction of where a proof may lay; usually recursivity is where such traps hide. Obviously a rigorous proof would need rigorous definitions and all.
“A program that can identify whether a very specific class of programs will halt” does disprove the stronger analog of the Halting Theorem that (I argued above) you’d need in order for it to make alignment impossible.