If we don’t have a preliminary definition of human values
Another, possibly even larger problem is that the values that we know of are quite varying and even opposing among people.
For the example of pain avoidance—maximizing pain avoidance might leave some people unhappy and even suffering. Sure that would be a minority, but are we ready to exclude minorities from the alignment, even small ones?
I would state that any defined set of values would leave a minority of people suffering. Who would be deciding which minorities are better or worse, what size of a minority is acceptable to leave behind to suffer, etc...?
I think that this makes the whole idea of alignment to some “human values” too ill-defined and incorrect.
One more contradiction—are human values allowed to change, or are they frozen? I think they might change, as humanity evolves and changes. But then, as AI interacts with the humanity, it can be convincing enough to push the values shift to whatever direction, which might not be a desirable outcome.
People are known to value racial purity and supporting genocide. Given some good convincing rhetoric, we could start supporting paperclip-maximizing just as well.
Human enhancement is one approach.
I like this idea, combined with AI-self-limitation. Suppose that (aligned) AI has to self-limit it’s growth so that it’s capabilities are always below the capabilities of enhanced humans? This would allow for slow, safe and controllable takeoff.
Is this a good strategy for alignment? What if instead of trying to tame the inherently dangerous fast-taking-off AI, we make it more controllable, by making it self-limiting, with some built in “capability brakes”?
Another, possibly even larger problem is that the values that we know of are quite varying and even opposing among people.
For the example of pain avoidance—maximizing pain avoidance might leave some people unhappy and even suffering. Sure that would be a minority, but are we ready to exclude minorities from the alignment, even small ones?
I would state that any defined set of values would leave a minority of people suffering. Who would be deciding which minorities are better or worse, what size of a minority is acceptable to leave behind to suffer, etc...?
I think that this makes the whole idea of alignment to some “human values” too ill-defined and incorrect.
One more contradiction—are human values allowed to change, or are they frozen? I think they might change, as humanity evolves and changes. But then, as AI interacts with the humanity, it can be convincing enough to push the values shift to whatever direction, which might not be a desirable outcome.
People are known to value racial purity and supporting genocide. Given some good convincing rhetoric, we could start supporting paperclip-maximizing just as well.
I like this idea, combined with AI-self-limitation. Suppose that (aligned) AI has to self-limit it’s growth so that it’s capabilities are always below the capabilities of enhanced humans? This would allow for slow, safe and controllable takeoff.
Is this a good strategy for alignment? What if instead of trying to tame the inherently dangerous fast-taking-off AI, we make it more controllable, by making it self-limiting, with some built in “capability brakes”?