Formal alignment is eventually possible; more sparsity means more verifiability. The ability to formally reason through implications is something that is currently possible, and will only become more so in the future. Once possible, sparse models can verify each other and access the benefits of open source game theory to create provably honest treaties and agreements, and by verifying that the margin to behaviors that cause unwanted effects is large all the way through the network, it will be able to create cooperation groups that resist adversarial examples quite well. However, messy models will still exist for a long time, and humans will need to be upgraded with better tools in order to keep up.
Performance is critical; the alignment property must be able to be trusted with relatively little verification, because exponential verification algorithms destroy any possibility of performance. Getting the formal verification algorithms below exponential requires building representations which deinterfere well, and similarly requires controlling the environment to not enter chaotic states (eg, fires) in places that are not built to be invariant to them (eg, fire protection).
In the mean time, the key question is how to preserve life and agency of information structures (eg, people) as best as possible. I expect that eventually an AI that is able to create perfect, fully-formally-verified intention-coprotection agreements will want us to have been doing our best at doing the approximate versions as early as we can; I doubt it’d waste energy torturing our descendants for it if we don’t, but certainly the more information that can be retained in order to produce a society of intention-coprotecting agents with the full variety of life, the better.
Hmm, I think it’s possible that sparser models will be much easier to verify, but it hardly seems inevitable. Certainly not clear if sparser models will be so much more interpretable that alignment becomes asymptotically cheap.
Formal alignment is eventually possible; more sparsity means more verifiability. The ability to formally reason through implications is something that is currently possible, and will only become more so in the future. Once possible, sparse models can verify each other and access the benefits of open source game theory to create provably honest treaties and agreements, and by verifying that the margin to behaviors that cause unwanted effects is large all the way through the network, it will be able to create cooperation groups that resist adversarial examples quite well. However, messy models will still exist for a long time, and humans will need to be upgraded with better tools in order to keep up.
Performance is critical; the alignment property must be able to be trusted with relatively little verification, because exponential verification algorithms destroy any possibility of performance. Getting the formal verification algorithms below exponential requires building representations which deinterfere well, and similarly requires controlling the environment to not enter chaotic states (eg, fires) in places that are not built to be invariant to them (eg, fire protection).
In the mean time, the key question is how to preserve life and agency of information structures (eg, people) as best as possible. I expect that eventually an AI that is able to create perfect, fully-formally-verified intention-coprotection agreements will want us to have been doing our best at doing the approximate versions as early as we can; I doubt it’d waste energy torturing our descendants for it if we don’t, but certainly the more information that can be retained in order to produce a society of intention-coprotecting agents with the full variety of life, the better.
[Comment edited using gpt3, text-curie-001]
Hmm, I think it’s possible that sparser models will be much easier to verify, but it hardly seems inevitable. Certainly not clear if sparser models will be so much more interpretable that alignment becomes asymptotically cheap.