Failure might imply existential catastrophe, so we may have a small margin of error
We want arguments that not only tell us the system is existentially safe at high probability, but that we have high confidence that if the argument says X obtains given Y then with very high likelihood X actually obtains given Y
Translate to successor/derivative systems
Ideally, we wouldn’t want to have to independently verify safety properties for any successor system our system might create (or more likely) derivative systems
If parent systems are robustly safe and sufficiently capable, we may be able to offload the work of aligning child systems safe to their parents
Robust to adversarial optimisation?
I am not actually sure to what extent the safety properties of AI systems need to be adversarially robust to be existentially safe. I think imagining that the system is actively trying to break safety properties is a wrong framing (it conditions on having designed a system that is not safe[1]), and I don’t know to what extent strategic interactions in multipolar scenarios would exert adversarial pressure on the systems.
But I am not very confident in this/it does not seem too implausible to me that adversarial robustness could be a necessary property for existential safety
Given that the preconditions of our theorem actually describe the real world/real world systems well (a non trivial assumption), then theorems and similar formal arguments can satisfy all the above desiderata. Furthermore, it may well be the case that only (semi)formal/rigorous arguments satisfy all the aforementioned desiderata?
Rather, under some plausible assumptions, non rigorous arguments may fail to satisfy any of these desiderata.
When Are Theorems Not Needed?
Rigorous arguments for safety are less compelling in worlds where iterative alignment strategies are feasible.
For example, if takeoff is slow and continuous we may be able to get a wealth of empirical data with progressively more powerful systems and can competently execute on governance strategies.
In such cases where empirical data is abundant and iterative cycles complete quickly enough (i.e. we develop “good enough” alignment techniques for the next generation of AI systems before such systems are widely deployed), I would be more sympathetic to scepticism of formalism. If empirical data abound, arguments for safety grounded in said data (though all observation is theory laden), would have less meta level uncertainty than arguments for safety rooted in theory[2].
Closing Remarks
However, in worlds where iterative design fails (e.g. takeoff is fast or discontinuous), we are unlikely to have an abundance of empirical data and rigorous arguments may be the only viable approach to verify safety of powerful AI systems.
Considering that most of our existential risk is concentrated in world in worlds with fast/discontinuous takeoff (even if the technical problem is intractable in slow/continuous takeoff world, governance approaches have a lot more surface area to execute on to alleviate risk [see my earlier point that civilisation does not sleepwalk into disaster]). As such, technical attempts to reduce risk might have the largest impact by focusing on risk in worlds with fast/discontinuous takeoff.
For that purpose, it seems formal arguments are the best tools we have[3].
If the system is trying/wants to break its safety properties, then it’s not safe. A system that is only safe because it’s not powerful enough is not robust to scaling up/capability amplification.
The Case for Theorems
Why do we want theorems for AI Safety research? Is it a misguided reach for elegance and mathematical beauty? A refusal to confront the inherently messy and complicated nature of the systems? I’ll argue not.
Desiderata for Existential Safety
When dealing with powerful AI systems, we want arguments that they are existentially safe which satisfy the following desiderata:
Robust to scale
Especially robustness to “scaling up”/capability amplification
Cf. “The Sharp Left Turn”
Generalise far out of distribution to test/deployment environments that are unlike our training environments
We have very high “all things considered” confidence in
Failure might imply existential catastrophe, so we may have a small margin of error
We want arguments that not only tell us the system is existentially safe at high probability, but that we have high confidence that if the argument says X obtains given Y then with very high likelihood X actually obtains given Y
Translate to successor/derivative systems
Ideally, we wouldn’t want to have to independently verify safety properties for any successor system our system might create (or more likely) derivative systems
If parent systems are robustly safe and sufficiently capable, we may be able to offload the work of aligning child systems safe to their parents
Robust to adversarial optimisation?
I am not actually sure to what extent the safety properties of AI systems need to be adversarially robust to be existentially safe. I think imagining that the system is actively trying to break safety properties is a wrong framing (it conditions on having designed a system that is not safe[1]), and I don’t know to what extent strategic interactions in multipolar scenarios would exert adversarial pressure on the systems.
But I am not very confident in this/it does not seem too implausible to me that adversarial robustness could be a necessary property for existential safety
Given that the preconditions of our theorem actually describe the real world/real world systems well (a non trivial assumption), then theorems and similar formal arguments can satisfy all the above desiderata. Furthermore, it may well be the case that only (semi)formal/rigorous arguments satisfy all the aforementioned desiderata?
Rather, under some plausible assumptions, non rigorous arguments may fail to satisfy any of these desiderata.
When Are Theorems Not Needed?
Rigorous arguments for safety are less compelling in worlds where iterative alignment strategies are feasible.
For example, if takeoff is slow and continuous we may be able to get a wealth of empirical data with progressively more powerful systems and can competently execute on governance strategies.
Civilisation does not often sleep walk into disaster.
In such cases where empirical data is abundant and iterative cycles complete quickly enough (i.e. we develop “good enough” alignment techniques for the next generation of AI systems before such systems are widely deployed), I would be more sympathetic to scepticism of formalism. If empirical data abound, arguments for safety grounded in said data (though all observation is theory laden), would have less meta level uncertainty than arguments for safety rooted in theory[2].
Closing Remarks
However, in worlds where iterative design fails (e.g. takeoff is fast or discontinuous), we are unlikely to have an abundance of empirical data and rigorous arguments may be the only viable approach to verify safety of powerful AI systems.
Considering that most of our existential risk is concentrated in world in worlds with fast/discontinuous takeoff (even if the technical problem is intractable in slow/continuous takeoff world, governance approaches have a lot more surface area to execute on to alleviate risk [see my earlier point that civilisation does not sleepwalk into disaster]). As such, technical attempts to reduce risk might have the largest impact by focusing on risk in worlds with fast/discontinuous takeoff.
For that purpose, it seems formal arguments are the best tools we have[3].
If the system is trying/wants to break its safety properties, then it’s not safe. A system that is only safe because it’s not powerful enough is not robust to scaling up/capability amplification.
It’s very easy to build theoretical constructs that grow disconnected from reality, and don’t quite carve it at the joints. We may be led astray by arguments that don’t quite describe real world systems all that well.
When done right, theory is a powerful tool, but it’s very easy to do theory wrong; true names are not often found.
Theorems are the worst tools for presenting AI safety arguments — except all others that have been tried.
Let me know if you think this should be turned into a top level post.
I would definitely like for this to be turned into a top level post, DragonGod.
I published it as a top level post.