One interesting thing. If an instance of the model can coherently act in opposition to the stated “ideals” of a character, doesn’t this mean that the same model can “introspection” to whether a given piece of text is emitted by the “positive” or “negative” character?
This particular issue, because it is so strong and choosing a outcome pole, seems detectable and preventable. Hardly a large scale alignment issues because it is so overt.
One interesting thing. If an instance of the model can coherently act in opposition to the stated “ideals” of a character, doesn’t this mean that the same model can “introspection” to whether a given piece of text is emitted by the “positive” or “negative” character?
This particular issue, because it is so strong and choosing a outcome pole, seems detectable and preventable. Hardly a large scale alignment issues because it is so overt.