Dagon comments on Can we create self-improving AIs that perfect their own ethics?

Dagon 30 Jan 2024 22:29 UTC
2 points
0
Yeah, there are a lot of sketches for how to test a system for various specific behaviors. But no actual gears-level definition of what would succeed at alignment in such a way as it does any good, while doing no (or acceptably small, being the key undefined variable) harm. A brick is aligned in that it does no harm. But it also doesn’t make anyone immortal or solve any resource-allocation pains that humans have.
- the gears to ascension 30 Jan 2024 23:08 UTC
  2 points
  0
  Parent
  do you know of any other sketches of how to measure that are reasonably close to mechanically specified?
  - Dagon 31 Jan 2024 18:32 UTC
    6 points
    0
    Parent
    Simulation or hidden Schelling fences seem to be the main mechanisms. I have seen zero ideas of WHAT to measure on a “true alignment” level. They all seem to be about noticing specific problems. I have seen none that try to quantify a “semi-aligned power” tradeoff between the good it does and the harm it does.
    I think Eliezer’s early writing (and, AFAIK, current thinking) that it must be perfect or all is lost, with nothing in between, probably makes the goal impossible.
  - ChristianKl 31 Jan 2024 14:37 UTC
    2 points
    0
    Parent
    You can measure AlphaGo’s ability to play go by letting it play go which you can very well mechanically specify. Just let it play a game against a pro. We don’t have a similar measurement for ethics.