This seems to lead to the solution of trying to make your metric one-way, in the sense that your metric should
Provide an upper-bound on the dangerousness of your network
Compress the space of networks which map to approximately the same dangerousness level on the low end of dangerousness, and expand the space of networks which map to approximately the same dangerousness level on the upper end of dangerous, so that you can train your network to minimize the metric, but when you train your network to maximize the metric you end up in a degenerate are with technically very high measured danger levels but in actuality very low levels of dangerousness.
We can hope (or possibly prove) that as you optimize upwards on the metric you get subject to goodheart’s curse, but the opposite occurs on the lower end.
This seems to lead to the solution of trying to make your metric one-way, in the sense that your metric should
Provide an upper-bound on the dangerousness of your network
Compress the space of networks which map to approximately the same dangerousness level on the low end of dangerousness, and expand the space of networks which map to approximately the same dangerousness level on the upper end of dangerous, so that you can train your network to minimize the metric, but when you train your network to maximize the metric you end up in a degenerate are with technically very high measured danger levels but in actuality very low levels of dangerousness.
We can hope (or possibly prove) that as you optimize upwards on the metric you get subject to goodheart’s curse, but the opposite occurs on the lower end.