I find your point 1 very interesting but point 2 may be based in part on a misunderstanding.
To expand on the last point, if A[*], the limiting agent, is aligned with H then it must contain at least implicitly some representation of H’s values (retrievable through IRL, for example). And so must A[i] for every i.
I think this is not how Paul hopes his scheme would work. If you read https://www.lesswrong.com/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims, it’s clear that in the LBO variant of IDA, A[1] can’t possibly learn H’s values. Instead A[1] is supposed to learn “corrigibility” from H and then after enough amplifications, A[n] will gain the ability to learn values from some external user (who may or may not be H) and then the “corrigibility” that was learned and preserved through the IDA process is supposed to make it want to help the user achieve their values.
I won’t deny probably misunderstanding parts of LDA but if the point is to learn corrigibility from H couldn’t you just say that corrigibility is a value that H has? Then use the same argument with “corrigibility” in place of “value”? (This assumes that corrigiblity is entirely defined with reference to H. If not, replace with the subset that is defined entirely from H, if that is empty then remove H).
If A[*] has H-derived-corrigibility then so must A[1] so distillation must preserve H-derived-corrigibility so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property, which can then be trained from some other user.
so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property
I’m imagining the problem statement for distillation being: we have a powerful aligned/corrigible agent. Now we want to train a faster agent which is also aligned/corrigible.
If there is a way to do this without starting from a more powerful agent, then I agree that we can skip the amplification process and jump straight to the goal.
I find your point 1 very interesting but point 2 may be based in part on a misunderstanding.
I think this is not how Paul hopes his scheme would work. If you read https://www.lesswrong.com/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims, it’s clear that in the LBO variant of IDA, A[1] can’t possibly learn H’s values. Instead A[1] is supposed to learn “corrigibility” from H and then after enough amplifications, A[n] will gain the ability to learn values from some external user (who may or may not be H) and then the “corrigibility” that was learned and preserved through the IDA process is supposed to make it want to help the user achieve their values.
I won’t deny probably misunderstanding parts of LDA but if the point is to learn corrigibility from H couldn’t you just say that corrigibility is a value that H has? Then use the same argument with “corrigibility” in place of “value”? (This assumes that corrigiblity is entirely defined with reference to H. If not, replace with the subset that is defined entirely from H, if that is empty then remove H).
If A[*] has H-derived-corrigibility then so must A[1] so distillation must preserve H-derived-corrigibility so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property, which can then be trained from some other user.
I’m imagining the problem statement for distillation being: we have a powerful aligned/corrigible agent. Now we want to train a faster agent which is also aligned/corrigible.
If there is a way to do this without starting from a more powerful agent, then I agree that we can skip the amplification process and jump straight to the goal.