I don’t think this is the hope, see here for more. I think the hope is that the base unit is ‘cheap’ and ‘honest’.
Hmm, cheap and honest makes some sense, but I’m surprised to hear that the hope is not that the base unit is aligned, because that seems to clash with how I’ve seen this discussed before. For example, from Ajeya’s post summarizing IDA:
The motivating problem that IDA attempts to solve: if we are only able to align agents that narrowly replicate human behavior, how can we build an AGI that is both aligned and ultimately much more capable than the best humans?
Which suggests to me that the base units (which narrowly replicate human behavior) are expected to be aligned.
More:
Moreover, because in each of its individual decisions each copy of A[0] continues to act just as a human personal assistant would act, we can hope that Amplify(H, A[0]) preserves alignment.
...
Because we assumed Amplify(H, A[0]) was aligned, we can hope that A[1] is also aligned if it is trained using sufficiently narrow techniques which introduce no new behaviors.
Which comes out and explicitly says that we want the amplify step to preserve alignment. (Which only makes sense if the agent at the previous step was aligned.)
Is it possible that this is just a terminological issue, where aligned is actually being used to mean what you would call honest (and not whatever Vaniver_2018 thought aligned meant)?
As many people have pointed out, it could be difficult to become confident that a system produced through this sort of process is aligned—that is, that all its cognitive work is actually directed towards solving the tasks it is intended to help with.
That definition of alignment seems to be pretty much the same thing as your honesty criterion:
Now it seems that the real goal is closer to an ‘honesty criterion’; if you ask a question, all the computation in that unit will be devoted to answering the question, and all messages between units are passed where the operator can see them, in plain English.
If so, then I’m curious what the difference is. What did Vaniver_2018 think that being aligned meant, and how is that different from just being honest?
If so, then I’m curious what the difference is. What did Vaniver_2018 think that being aligned meant, and how is that different from just being honest?
Vaniver_2018 thought ‘aligned’ meant something closer to “I was glad I ran the program” instead of “the program did what I told it to do” or “the program wasn’t deliberately out to get me.”
Hmm, cheap and honest makes some sense, but I’m surprised to hear that the hope is not that the base unit is aligned, because that seems to clash with how I’ve seen this discussed before. For example, from Ajeya’s post summarizing IDA:
Which suggests to me that the base units (which narrowly replicate human behavior) are expected to be aligned.
More:
Which comes out and explicitly says that we want the amplify step to preserve alignment. (Which only makes sense if the agent at the previous step was aligned.)
Is it possible that this is just a terminological issue, where aligned is actually being used to mean what you would call honest (and not whatever Vaniver_2018 thought aligned meant)?
Some evidence in favor of this, from Andreas’s Factored Cognition post:
That definition of alignment seems to be pretty much the same thing as your honesty criterion:
If so, then I’m curious what the difference is. What did Vaniver_2018 think that being aligned meant, and how is that different from just being honest?
Vaniver_2018 thought ‘aligned’ meant something closer to “I was glad I ran the program” instead of “the program did what I told it to do” or “the program wasn’t deliberately out to get me.”