So, to summarize, I think the key upside of this dialogue is a rough preliminary sketch of a bridge between the formalism of ethicophysics and how one might hope to use it in the context of AI existential safety.
As a result, it should be easier for readers to evaluate the overall approach.
At the same time, I think the main open problem for anyone interested in this (or in any other) approach to AI existential safety is how well does it hold with respect to recursive self-improvement.
Both the powerful AIs and the ecosystems of powerful AIs have inherently very high potential for recursive self-improvement (which might be not unlimited, but might encounter various thresholds at which it saturates, at least for some periods of time, but nevertheless is likely to result in a period of rapid changes, where not only capabilities, but the nature of AI systems in question, their architecture, algorithms, and, unfortunately, values, might change dramatically).
So, any approach to AI existential safety (this approach, and any other possible approach) needs to be eventually evaluated with respect to this likely rapid self-improvement and various self-modification.
Basically, is the coming self-improvement trajectory completely unpredictable, or could we hope for some invariants to be preserved, and specifically could we find some invariants which are both feasible to preserve during rapid self-modification and which might result in the outcomes we would consider reasonable.
E.g. if the resulting AIs are mostly “supermoral”, can we just rely on them taking care that their successors and creations are “supermoral” as well, or are any extra efforts on our part are required to make this more likely? We would probably want to look at “details of the ethicophysical dynamics” closely in connection with this, rather than just relying on the high-level “statements of hope”...
So, to summarize, I think the key upside of this dialogue is a rough preliminary sketch of a bridge between the formalism of ethicophysics and how one might hope to use it in the context of AI existential safety.
As a result, it should be easier for readers to evaluate the overall approach.
At the same time, I think the main open problem for anyone interested in this (or in any other) approach to AI existential safety is how well does it hold with respect to recursive self-improvement.
Both the powerful AIs and the ecosystems of powerful AIs have inherently very high potential for recursive self-improvement (which might be not unlimited, but might encounter various thresholds at which it saturates, at least for some periods of time, but nevertheless is likely to result in a period of rapid changes, where not only capabilities, but the nature of AI systems in question, their architecture, algorithms, and, unfortunately, values, might change dramatically).
So, any approach to AI existential safety (this approach, and any other possible approach) needs to be eventually evaluated with respect to this likely rapid self-improvement and various self-modification.
Basically, is the coming self-improvement trajectory completely unpredictable, or could we hope for some invariants to be preserved, and specifically could we find some invariants which are both feasible to preserve during rapid self-modification and which might result in the outcomes we would consider reasonable.
E.g. if the resulting AIs are mostly “supermoral”, can we just rely on them taking care that their successors and creations are “supermoral” as well, or are any extra efforts on our part are required to make this more likely? We would probably want to look at “details of the ethicophysical dynamics” closely in connection with this, rather than just relying on the high-level “statements of hope”...