This doesn’t seem like it holds up because a CIRL agent would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.
But maybe continuing to be deferential (in many/most situations) would be part of the utility function it converged towards? Not saying this consideration refutes your point, but it is a consideration.
(I don’t have much of an opinion regarding the study-worthiness of CIRL btw, and I know very little about CIRL. Though I do have the perspective that one alignment-methodology need not necessarily be the “enemy” of another, partly because we might want AGI-systems where sub-systems also are AGIs (and based on different alignment-methodologies), and where we see whether outputs from different sub-systems converge.)
But maybe continuing to be deferential (in many/most situations) would be part of the utility function it converged towards? Not saying this consideration refutes your point, but it is a consideration.
(I don’t have much of an opinion regarding the study-worthiness of CIRL btw, and I know very little about CIRL. Though I do have the perspective that one alignment-methodology need not necessarily be the “enemy” of another, partly because we might want AGI-systems where sub-systems also are AGIs (and based on different alignment-methodologies), and where we see whether outputs from different sub-systems converge.)