Giving people a slider with “safety” written on one end and “capability” written on the other, and then trying to get people to set it close enough to the “safety” end, seems like a bad situation. (Very similar to points you raised in your 5-min-timer list.)
An improvement on this situation would be something which looked more like a theoretical solution to Goodhart’s law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities (“this is how you get the most of what you want”), allowing ML researchers to develop algorithms orienting toward this.
Even better (but similarly), an approach where capability and alignment go hand in hand would be ideal—a way to directly optimize for “what I mean, not what I say”, such that it is obvious that things are just worse if you depart from this.
However, maybe those things are just pipe dreams—this should not be the fundamental reason to ignore impact measures, unless promising approaches in the other two categories are pointed out; and even then, impact measures as a backup plan would still seem desirable.
My response to this is roughly that I prefer mild optimization techniques for this back up plan. Like impact measures, they are vulnerable to the objection above; but they seem better in terms of the objection which follows.
Part of my intuition, however, is just that mild optimization is going to be closer to the theoretical heart of anti-Goodhart technology. (Evidence for this is that quantilization seems, to me, theoretically nicer than any low-impact measure.)
In other words, conditioned on having a story more like “this is how you get the most of what you want” rather than a slider reading “safety ------- capability”, I more expect to see a mild optimizer as opposed to an impact measure.
Unlike mild-optimization approaches, impact measures still allow potentially large amounts of optimization pressure to be applied to a metric that isn’t exactly what we want.
It is apparent that some attempted impact measures run into nearest-unblocked-strategy type problems, where the supposed patch just creates a different problem when a lot of optimization pressure is applied. This gives reason for concern even if you can’t spot a concrete problem with a given impact measure: impact measures don’t address the basic nearest-unblocked-strategy problem, and so are liable to severe Goodheartian results.
If an impact measure were perfect, then adding it as a penalty on an otherwise (slightly or greatly) misaligned utility function just seems good, and adding it as a penalty to a perfectly aligned utility function would seem an acceptable loss. If impact is slightly misspecified, however, then adding it as a penalty may make a utility function less friendly than it otherwise would be.
(It is a desirable feature of safety measures, that those safety measures do not risk decreasing alignment.)
On the other hand, a mild optimizer seems to get the spirit of what’s wanted from low-impact.
This is only somewhat true: a mild optimizer may create a catastrophe through negligence, where a low-impact system would try hard to avoid doing so. However, I view this as a much more acceptable and tractable problem than the nearest-unblocked-strategy type problem.
Both mild optimization and impact measures require separate approaches to “doing what people want”.
Arguably this is OK, because they could greatly reduce the bar for alignment of specified utility functions. However, it seems possible to me that we need to understand more about the fundamentally puzzling nature of “do what I want” before we can be confident even in low-impact or mild-optimization approaches, because it is difficult to confidently say that an approach avoids risk of hugely violating your preferences while still being so confused about what human preference even is.
Giving people a slider with “safety” written on one end and “capability” written on the other, and then trying to get people to set it close enough to the “safety” end, seems like a bad situation. (Very similar to points you raised in your 5-min-timer list.)
An improvement on this situation would be something which looked more like a theoretical solution to Goodhart’s law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities (“this is how you get the most of what you want”), allowing ML researchers to develop algorithms orienting toward this.
Even better (but similarly), an approach where capability and alignment go hand in hand would be ideal—a way to directly optimize for “what I mean, not what I say”, such that it is obvious that things are just worse if you depart from this.
However, maybe those things are just pipe dreams—this should not be the fundamental reason to ignore impact measures, unless promising approaches in the other two categories are pointed out; and even then, impact measures as a backup plan would still seem desirable.
My response to this is roughly that I prefer mild optimization techniques for this back up plan. Like impact measures, they are vulnerable to the objection above; but they seem better in terms of the objection which follows.
Part of my intuition, however, is just that mild optimization is going to be closer to the theoretical heart of anti-Goodhart technology. (Evidence for this is that quantilization seems, to me, theoretically nicer than any low-impact measure.)
In other words, conditioned on having a story more like “this is how you get the most of what you want” rather than a slider reading “safety ------- capability”, I more expect to see a mild optimizer as opposed to an impact measure.
Unlike mild-optimization approaches, impact measures still allow potentially large amounts of optimization pressure to be applied to a metric that isn’t exactly what we want.
It is apparent that some attempted impact measures run into nearest-unblocked-strategy type problems, where the supposed patch just creates a different problem when a lot of optimization pressure is applied. This gives reason for concern even if you can’t spot a concrete problem with a given impact measure: impact measures don’t address the basic nearest-unblocked-strategy problem, and so are liable to severe Goodheartian results.
If an impact measure were perfect, then adding it as a penalty on an otherwise (slightly or greatly) misaligned utility function just seems good, and adding it as a penalty to a perfectly aligned utility function would seem an acceptable loss. If impact is slightly misspecified, however, then adding it as a penalty may make a utility function less friendly than it otherwise would be.
(It is a desirable feature of safety measures, that those safety measures do not risk decreasing alignment.)
On the other hand, a mild optimizer seems to get the spirit of what’s wanted from low-impact.
This is only somewhat true: a mild optimizer may create a catastrophe through negligence, where a low-impact system would try hard to avoid doing so. However, I view this as a much more acceptable and tractable problem than the nearest-unblocked-strategy type problem.
Both mild optimization and impact measures require separate approaches to “doing what people want”.
Arguably this is OK, because they could greatly reduce the bar for alignment of specified utility functions. However, it seems possible to me that we need to understand more about the fundamentally puzzling nature of “do what I want” before we can be confident even in low-impact or mild-optimization approaches, because it is difficult to confidently say that an approach avoids risk of hugely violating your preferences while still being so confused about what human preference even is.