I chose the “train a shoulder advisor” framing specifically to keep my/Eliezer’s models separate from the participants’ own models. And I do think this worked pretty well—I’ve had multiple conversations with a participant where they say something, I disagree with it, and then they say “yup, that’s what my John model said”—implying that they did in fact disagree with their John model. (That’s not quite direct evidence of maintaining a separate ontology, but it’s adjacent.)
I chose the “train a shoulder advisor” framing specifically to keep my/Eliezer’s models separate from the participants’ own models. And I do think this worked pretty well—I’ve had multiple conversations with a participant where they say something, I disagree with it, and then they say “yup, that’s what my John model said”—implying that they did in fact disagree with their John model. (That’s not quite direct evidence of maintaining a separate ontology, but it’s adjacent.)