Thanks for writing this up. I really liked this framing when I first read about it but reading this post has helped me reflect more deeply on it.
I’d also like to know your thoughts on whether Chris Olah’s original framing, that anything which advances this ‘present margin of safety research’ is net positive, is the correct response to this uncertainty.
I wouldn’t call it correct or incorrect only useful in some ways and not others. Whether it’s net positive may rely on whether it is used by people in cases where it is appropriate/useful.
As an educational resource/communication tool, I think this framing is useful. It’s often useful to collapse complex topics into few axes and construct idealised patterns, in this case a difficulty-distribution on which we place techniques by the kinds of scenarios where they provide marginal safety. This could be useful for helping people initially orient to existing ideas in the field or in governance or possibly when making funding decisions.
However, I feel like as a tool to reduce fundamental confusion about AI systems, it’s not very useful. The issue is that many of the current ideas we have in AI alignment are based significantly on pre-formal conjecture that is not grounded in observations of real world systems (see the Alignment Problem from a Deep Learning Perspective). Before we observe more advanced future systems, we should be highly uncertain about existing ideas. Moreover, it seems like this scale attempts to describe reality via the set of solutions which produce some outcome in it? This seems like an abstraction that is unlikely to be useful.
In other words, I think it’s possible that this framing leads to confusion between the map and the territory, where the map is making predictions about tools that are useful in territory which we have yet to observe.
To illustrate how such an axis may be unhelpful if you were trying to think more clearly, consider the equivalent for medicine. Diseases can be divided up into varying classes on difficulty to cure with corresponding research being useful for curing them. Cuts/Scrapes are self-mending whereas infections require corresponding antibiotics/antivirals, immune disorders and cancers are diverse and therefore span various levels of difficulties amongst their instantiations. It’s not clear to me that biologists/doctors would find much use from conjecture on exactly how hard vs likely each disease is to occur, especially in worlds where you lack a fundamental understanding of the related phenomena. Possibly, a closer analogy would be trying to troubleshoot ways evolution can generate highly dangerous species like humans.
I think my attitude here leads into more takes about good and bad ways to discuss which research we should prioritise but I’m not sure how to convey those concisely. Hopefully this is useful.
You’re right that I think this is more useful as an unscientific way for (probably less technical governance and strategy people) to orientate towards AI alignment than for actually carving up reality. I wrote the post with that audience and that framing in mind. By the same logic, your chart of how difficult various injuries and diseases are to fix would be very useful e.g. as a poster in a military triage tent even if it isn’t useful for biologists or trained doctors.
However, while I didn’t explore the idea much I do think that it is possible to cash this scale out as an actual variable related to system behavior, something along the lines of ‘how adversarial are systems/how many extra bits of optimization over and above behavioral feedback are needed’. See here for further discussion on that. Evan Hubinger also talked in a bit more detail about what might be computationally different about ML models in low vs high adversarialness worlds here.
Thanks for writing this up. I really liked this framing when I first read about it but reading this post has helped me reflect more deeply on it.
I wouldn’t call it correct or incorrect only useful in some ways and not others. Whether it’s net positive may rely on whether it is used by people in cases where it is appropriate/useful.
As an educational resource/communication tool, I think this framing is useful. It’s often useful to collapse complex topics into few axes and construct idealised patterns, in this case a difficulty-distribution on which we place techniques by the kinds of scenarios where they provide marginal safety. This could be useful for helping people initially orient to existing ideas in the field or in governance or possibly when making funding decisions.
However, I feel like as a tool to reduce fundamental confusion about AI systems, it’s not very useful. The issue is that many of the current ideas we have in AI alignment are based significantly on pre-formal conjecture that is not grounded in observations of real world systems (see the Alignment Problem from a Deep Learning Perspective). Before we observe more advanced future systems, we should be highly uncertain about existing ideas. Moreover, it seems like this scale attempts to describe reality via the set of solutions which produce some outcome in it? This seems like an abstraction that is unlikely to be useful.
In other words, I think it’s possible that this framing leads to confusion between the map and the territory, where the map is making predictions about tools that are useful in territory which we have yet to observe.
To illustrate how such an axis may be unhelpful if you were trying to think more clearly, consider the equivalent for medicine. Diseases can be divided up into varying classes on difficulty to cure with corresponding research being useful for curing them. Cuts/Scrapes are self-mending whereas infections require corresponding antibiotics/antivirals, immune disorders and cancers are diverse and therefore span various levels of difficulties amongst their instantiations. It’s not clear to me that biologists/doctors would find much use from conjecture on exactly how hard vs likely each disease is to occur, especially in worlds where you lack a fundamental understanding of the related phenomena. Possibly, a closer analogy would be trying to troubleshoot ways evolution can generate highly dangerous species like humans.
I think my attitude here leads into more takes about good and bad ways to discuss which research we should prioritise but I’m not sure how to convey those concisely. Hopefully this is useful.
You’re right that I think this is more useful as an unscientific way for (probably less technical governance and strategy people) to orientate towards AI alignment than for actually carving up reality. I wrote the post with that audience and that framing in mind. By the same logic, your chart of how difficult various injuries and diseases are to fix would be very useful e.g. as a poster in a military triage tent even if it isn’t useful for biologists or trained doctors.
However, while I didn’t explore the idea much I do think that it is possible to cash this scale out as an actual variable related to system behavior, something along the lines of ‘how adversarial are systems/how many extra bits of optimization over and above behavioral feedback are needed’. See here for further discussion on that. Evan Hubinger also talked in a bit more detail about what might be computationally different about ML models in low vs high adversarialness worlds here.