I think this quantum fields example is perhaps not all that forceful, because in your OP you state
maybe a faithful and robust translation would be so long in the system’s “internal language” that the translation wouldn’t fit in the system
However, it sounds like you’re describing a system where we represent humans using quantum fields as a routine matter, so fitting the translation into the system isn’t sounding like a huge problem? Like, if I want to know the answer to some moral dilemma, I can simulate my favorite philosopher at the level of quantum fields in order to hear what they would say if they were asked about the dilemma. Sounds like it could be just as good as an em, where alignment is concerned.
It’s hard for me to imagine a world where developing representations that allow you to make good next-token predictions etc. doesn’t also develop representations that can somehow be useful for alignment. Would be interested to hear fleshed-out counterexamples.
I think this quantum fields example is perhaps not all that forceful, because in your OP you state
However, it sounds like you’re describing a system where we represent humans using quantum fields as a routine matter, so fitting the translation into the system isn’t sounding like a huge problem? Like, if I want to know the answer to some moral dilemma, I can simulate my favorite philosopher at the level of quantum fields in order to hear what they would say if they were asked about the dilemma. Sounds like it could be just as good as an em, where alignment is concerned.
It’s hard for me to imagine a world where developing representations that allow you to make good next-token predictions etc. doesn’t also develop representations that can somehow be useful for alignment. Would be interested to hear fleshed-out counterexamples.