As I said elsewhere, my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans), and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don’t know which.
The definition problem to me feels more like it has to do with the greater philosophical and political issue that even if we could behold the AGI to a simple set of values, we don’t really know what those values are. I’m thinking more about the technical part because I think that’s the only one liable to be tractable. If we wanted some horrifying Pi Digit Maximizer that just spends eternity keeping calculating more digits of pi, that’s a very easily formally defined value, but we don’t know how to imbue that precisely either. However, there is an additional layer of complexity when more human values are involved in that they can’t be formalised that neatly, and so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them; or the AGI will need to guess the will of its human operators in some way. So maybe that part inside is what makes it rigorously impossible.
Anyway yeah, I expect any mathematical proof wouldn’t exclude the possibility of any alignment, not even approximate or temporary, just like you say for the halting problem. But it could at least mean that any AGI with sufficient power is potentially a ticking time bomb, and we don’t know what would set it off.
Just found your insightful comment. I’ve been thinking about this for three years. Some thoughts expanding on your ideas:
my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans)...
In other words, alignment requires sufficient control. Specifically, it requires AGI to have a control system with enough capacity to detect, model, simulate, evaluate, and correct outside effects propagated by the AGI’s own components.
… and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don’t know which.
For example, what if AGI is in some kind of convergence basin where the changing situations/conditions tend to converge outside the ranges humans can survive under?
so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them
There’s a problem you are pointing of somehow mapping the various preferences – expressed over time by diverse humans from within their (perceived) contexts – onto reference values. This involves making (irreconcilable) normative assumptions of how to map the dimensionality of the raw expressions of preferences onto internal reference values. Basically, you’re dealing with NP-complex combinatorics such as encountered with the knapsack problem.
Further, it raises the question of how to make comparisons across all the possible concrete outside effects of the machinery against the internal reference values, such to identify misalignments/errors to correct. Ie. just internalising and holding abstract values is not enough – there would have to be some robust implementation process that translates the values into concrete effects.
As I said elsewhere, my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans), and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don’t know which.
The definition problem to me feels more like it has to do with the greater philosophical and political issue that even if we could behold the AGI to a simple set of values, we don’t really know what those values are. I’m thinking more about the technical part because I think that’s the only one liable to be tractable. If we wanted some horrifying Pi Digit Maximizer that just spends eternity keeping calculating more digits of pi, that’s a very easily formally defined value, but we don’t know how to imbue that precisely either. However, there is an additional layer of complexity when more human values are involved in that they can’t be formalised that neatly, and so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them; or the AGI will need to guess the will of its human operators in some way. So maybe that part inside is what makes it rigorously impossible.
Anyway yeah, I expect any mathematical proof wouldn’t exclude the possibility of any alignment, not even approximate or temporary, just like you say for the halting problem. But it could at least mean that any AGI with sufficient power is potentially a ticking time bomb, and we don’t know what would set it off.
Just found your insightful comment. I’ve been thinking about this for three years. Some thoughts expanding on your ideas:
In other words, alignment requires sufficient control. Specifically, it requires AGI to have a control system with enough capacity to detect, model, simulate, evaluate, and correct outside effects propagated by the AGI’s own components.
For example, what if AGI is in some kind of convergence basin where the changing situations/conditions tend to converge outside the ranges humans can survive under?
There’s a problem you are pointing of somehow mapping the various preferences – expressed over time by diverse humans from within their (perceived) contexts – onto reference values. This involves making (irreconcilable) normative assumptions of how to map the dimensionality of the raw expressions of preferences onto internal reference values. Basically, you’re dealing with NP-complex combinatorics such as encountered with the knapsack problem.
Further, it raises the question of how to make comparisons across all the possible concrete outside effects of the machinery against the internal reference values, such to identify misalignments/errors to correct. Ie. just internalising and holding abstract values is not enough – there would have to be some robust implementation process that translates the values into concrete effects.