Simulation or hidden Schelling fences seem to be the main mechanisms. I have seen zero ideas of WHAT to measure on a “true alignment” level. They all seem to be about noticing specific problems. I have seen none that try to quantify a “semi-aligned power” tradeoff between the good it does and the harm it does.
I think Eliezer’s early writing (and, AFAIK, current thinking) that it must be perfect or all is lost, with nothing in between, probably makes the goal impossible.
You can measure AlphaGo’s ability to play go by letting it play go which you can very well mechanically specify. Just let it play a game against a pro. We don’t have a similar measurement for ethics.
do you know of any other sketches of how to measure that are reasonably close to mechanically specified?
Simulation or hidden Schelling fences seem to be the main mechanisms. I have seen zero ideas of WHAT to measure on a “true alignment” level. They all seem to be about noticing specific problems. I have seen none that try to quantify a “semi-aligned power” tradeoff between the good it does and the harm it does.
I think Eliezer’s early writing (and, AFAIK, current thinking) that it must be perfect or all is lost, with nothing in between, probably makes the goal impossible.
You can measure AlphaGo’s ability to play go by letting it play go which you can very well mechanically specify. Just let it play a game against a pro. We don’t have a similar measurement for ethics.