Simulation or hidden Schelling fences seem to be the main mechanisms. I have seen zero ideas of WHAT to measure on a “true alignment” level. They all seem to be about noticing specific problems. I have seen none that try to quantify a “semi-aligned power” tradeoff between the good it does and the harm it does.
I think Eliezer’s early writing (and, AFAIK, current thinking) that it must be perfect or all is lost, with nothing in between, probably makes the goal impossible.
Simulation or hidden Schelling fences seem to be the main mechanisms. I have seen zero ideas of WHAT to measure on a “true alignment” level. They all seem to be about noticing specific problems. I have seen none that try to quantify a “semi-aligned power” tradeoff between the good it does and the harm it does.
I think Eliezer’s early writing (and, AFAIK, current thinking) that it must be perfect or all is lost, with nothing in between, probably makes the goal impossible.