Also… alignment is obviously continuum and of course 100% alignment with all human values is impossible.
A different thing you could prove is whether it’s possible to guarantee human control over an AI system as it becomes more intelligent.
There’s also a concern that a slightly unaligned system may become more and more aligned as its intelligence is scaled up (either by humans re-building/trianing it with more parameters/hardware or via recursive self-improvement). It would useful if someone could prove whether that is impossible to prevent.
I need to think about this more and read Yampolsky’s paper to really understand what would be the most useful to prove is possible or impossible.
Also… alignment is obviously continuum and of course 100% alignment with all human values is impossible.
A different thing you could prove is whether it’s possible to guarantee human control over an AI system as it becomes more intelligent.
There’s also a concern that a slightly unaligned system may become more and more aligned as its intelligence is scaled up (either by humans re-building/trianing it with more parameters/hardware or via recursive self-improvement). It would useful if someone could prove whether that is impossible to prevent.
I need to think about this more and read Yampolsky’s paper to really understand what would be the most useful to prove is possible or impossible.