Buck’s claim is that safety engineering is mostly focused on problems where there are a huge number of parts that we can understand and test (e.g. airplanes) and the main question is about ensuring failure rates are sufficiently low under realistic operating conditions.
In the case of AIs, we might have systems that look more like a single (likely black-box) AI with access to various tools and the safety failures we’re worried about are concentrated in the AI system itself.
This seems much more analogous to insider threat reduction work than the places where safety engineering is typically applied.
I agree we might end up in a world like that, where it proves impossible to make a decent safety case. I just think of the ~whole goal of alignment research as figuring out how to avoid that world, i.e. of figuring out how to mitigate/estimate the risk as much/precisely as needed to make TAI worth building.
Currently, AI risk estimates are mostly just verbal statements like “I don’t know man, probably some double digit chance of extinction.” This is exceedingly unlike the sort of predictably tolerable risk humanity normally expects from its engineering projects, and which e.g. allows for decent safety cases. So I think it’s quite important to notice how far we currently are from being able to make them, since that suggests the scope and nature of the problem.
I agree we might end up in a world like that, where it proves impossible to make a decent safety case.
I don’t think that thing I said is consistent with “impossible to make a safety case good enough to make TAI worth building”? I think you can probably make a safety case which gets to around 1-5% risk while having AIs that are basically black boxs and which are very powerful, but not arbitrarily powerful. (Such a safety case might require decently expensive measures.) (1-5% risk can be consistent with this being worth it—e.g. if there is an impending hot war. That said, it is still a horrifying level of risk that demands vastly more investment.)
See e.g. control for what one part of this safety case could look like. I think that control can go quite far.
Other parts could look like:
A huge array of model organisms experiments which convince us that P(serious misalignment) is low (or low given XYZ adjustments to training).
coup probes or other simple runtime detection/monitoring techiques.
ELK/honesty techniques which seem to work in a wide variety of cases.
Ensuring AIs closely imitate humans and generalize similarly to humans in the cases we checked.
Ensuring that AIs are very unlikely to be schemers via making sure most of their reasoning happens in CoT and their forward passes are quite weak. (I often think about this from the perspective of control, but it can also be considered separately.)
It’s possible you don’t think any of this stuff gets off the ground. Fair enough if so.
Buck’s claim is that safety engineering is mostly focused on problems where there are a huge number of parts that we can understand and test (e.g. airplanes) and the main question is about ensuring failure rates are sufficiently low under realistic operating conditions.
In the case of AIs, we might have systems that look more like a single (likely black-box) AI with access to various tools and the safety failures we’re worried about are concentrated in the AI system itself.
This seems much more analogous to insider threat reduction work than the places where safety engineering is typically applied.
I agree we might end up in a world like that, where it proves impossible to make a decent safety case. I just think of the ~whole goal of alignment research as figuring out how to avoid that world, i.e. of figuring out how to mitigate/estimate the risk as much/precisely as needed to make TAI worth building.
Currently, AI risk estimates are mostly just verbal statements like “I don’t know man, probably some double digit chance of extinction.” This is exceedingly unlike the sort of predictably tolerable risk humanity normally expects from its engineering projects, and which e.g. allows for decent safety cases. So I think it’s quite important to notice how far we currently are from being able to make them, since that suggests the scope and nature of the problem.
I don’t think that thing I said is consistent with “impossible to make a safety case good enough to make TAI worth building”? I think you can probably make a safety case which gets to around 1-5% risk while having AIs that are basically black boxs and which are very powerful, but not arbitrarily powerful. (Such a safety case might require decently expensive measures.) (1-5% risk can be consistent with this being worth it—e.g. if there is an impending hot war. That said, it is still a horrifying level of risk that demands vastly more investment.)
See e.g. control for what one part of this safety case could look like. I think that control can go quite far.
Other parts could look like:
A huge array of model organisms experiments which convince us that P(serious misalignment) is low (or low given XYZ adjustments to training).
coup probes or other simple runtime detection/monitoring techiques.
ELK/honesty techniques which seem to work in a wide variety of cases.
Ensuring AIs closely imitate humans and generalize similarly to humans in the cases we checked.
Ensuring that AIs are very unlikely to be schemers via making sure most of their reasoning happens in CoT and their forward passes are quite weak. (I often think about this from the perspective of control, but it can also be considered separately.)
It’s possible you don’t think any of this stuff gets off the ground. Fair enough if so.