Suppose that you gave it a bunch of labeled data about what counts as “good” and “bad”.
If your alignment strategy strongly depends on teaching the AGI ethics via labeled training data, you’ve already lost.
And if your alignment strategy strongly depends on creating innumerable copies of an UFAI and banking on the anthropic principle to save you, then you’ve already lost spectacularly.
If you can’t point to specific submodules within the AGI and say, “Here is where it uses this particular version of predictive coding to model human needs/values/goals,” and, “Here is where it represents its own needs/values/goals,” and, “Here is where its internal representation of needs/values/goals drives its planning and behavior,” and, “Here is where it routes its model of human values to the internal representation of its own values in a way that will automatically make it more aligned the more it learns about humanity,” then you have already lost (but only probably).
Basically, the only sure way to get robust alignment is to make the AGI highly non-alien. Or as you put it:
Those who can deal with devils, don’t need to, for they can simply summon angels instead.
Or rather: Those who can create devils and verify that those devils will take particular actually-beneficial actions as part of a complex diabolical compact, can more easily create angels that will take those actually-beneficial actions unconditionally.
If your alignment strategy strongly depends on teaching the AGI ethics via labeled training data, you’ve already lost.
And if your alignment strategy strongly depends on creating innumerable copies of an UFAI and banking on the anthropic principle to save you, then you’ve already lost spectacularly.
If you can’t point to specific submodules within the AGI and say, “Here is where it uses this particular version of predictive coding to model human needs/values/goals,” and, “Here is where it represents its own needs/values/goals,” and, “Here is where its internal representation of needs/values/goals drives its planning and behavior,” and, “Here is where it routes its model of human values to the internal representation of its own values in a way that will automatically make it more aligned the more it learns about humanity,” then you have already lost (but only probably).
Basically, the only sure way to get robust alignment is to make the AGI highly non-alien. Or as you put it: