Yeah, there are a lot of sketches for how to test a system for various specific behaviors. But no actual gears-level definition of what would succeed at alignment in such a way as it does any good, while doing no (or acceptably small, being the key undefined variable) harm. A brick is aligned in that it does no harm. But it also doesn’t make anyone immortal or solve any resource-allocation pains that humans have.
Simulation or hidden Schelling fences seem to be the main mechanisms. I have seen zero ideas of WHAT to measure on a “true alignment” level. They all seem to be about noticing specific problems. I have seen none that try to quantify a “semi-aligned power” tradeoff between the good it does and the harm it does.
I think Eliezer’s early writing (and, AFAIK, current thinking) that it must be perfect or all is lost, with nothing in between, probably makes the goal impossible.
You can measure AlphaGo’s ability to play go by letting it play go which you can very well mechanically specify. Just let it play a game against a pro. We don’t have a similar measurement for ethics.
Yeah, there are a lot of sketches for how to test a system for various specific behaviors. But no actual gears-level definition of what would succeed at alignment in such a way as it does any good, while doing no (or acceptably small, being the key undefined variable) harm. A brick is aligned in that it does no harm. But it also doesn’t make anyone immortal or solve any resource-allocation pains that humans have.
do you know of any other sketches of how to measure that are reasonably close to mechanically specified?
Simulation or hidden Schelling fences seem to be the main mechanisms. I have seen zero ideas of WHAT to measure on a “true alignment” level. They all seem to be about noticing specific problems. I have seen none that try to quantify a “semi-aligned power” tradeoff between the good it does and the harm it does.
I think Eliezer’s early writing (and, AFAIK, current thinking) that it must be perfect or all is lost, with nothing in between, probably makes the goal impossible.
You can measure AlphaGo’s ability to play go by letting it play go which you can very well mechanically specify. Just let it play a game against a pro. We don’t have a similar measurement for ethics.