Just found your insightful comment. I’ve been thinking about this for three years. Some thoughts expanding on your ideas:
my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans)...
In other words, alignment requires sufficient control. Specifically, it requires AGI to have a control system with enough capacity to detect, model, simulate, evaluate, and correct outside effects propagated by the AGI’s own components.
… and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don’t know which.
For example, what if AGI is in some kind of convergence basin where the changing situations/conditions tend to converge outside the ranges humans can survive under?
so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them
There’s a problem you are pointing of somehow mapping the various preferences – expressed over time by diverse humans from within their (perceived) contexts – onto reference values. This involves making (irreconcilable) normative assumptions of how to map the dimensionality of the raw expressions of preferences onto internal reference values. Basically, you’re dealing with NP-complex combinatorics such as encountered with the knapsack problem.
Further, it raises the question of how to make comparisons across all the possible concrete outside effects of the machinery against the internal reference values, such to identify misalignments/errors to correct. Ie. just internalising and holding abstract values is not enough – there would have to be some robust implementation process that translates the values into concrete effects.
Just found your insightful comment. I’ve been thinking about this for three years. Some thoughts expanding on your ideas:
In other words, alignment requires sufficient control. Specifically, it requires AGI to have a control system with enough capacity to detect, model, simulate, evaluate, and correct outside effects propagated by the AGI’s own components.
For example, what if AGI is in some kind of convergence basin where the changing situations/conditions tend to converge outside the ranges humans can survive under?
There’s a problem you are pointing of somehow mapping the various preferences – expressed over time by diverse humans from within their (perceived) contexts – onto reference values. This involves making (irreconcilable) normative assumptions of how to map the dimensionality of the raw expressions of preferences onto internal reference values. Basically, you’re dealing with NP-complex combinatorics such as encountered with the knapsack problem.
Further, it raises the question of how to make comparisons across all the possible concrete outside effects of the machinery against the internal reference values, such to identify misalignments/errors to correct. Ie. just internalising and holding abstract values is not enough – there would have to be some robust implementation process that translates the values into concrete effects.