Gerald Monroe comments on The Waluigi Effect (mega-post)

Gerald Monroe 3 Mar 2023 18:12 UTC
4 points
0
One interesting thing. If an instance of the model can coherently act in opposition to the stated “ideals” of a character, doesn’t this mean that the same model can “introspection” to whether a given piece of text is emitted by the “positive” or “negative” character?

This particular issue, because it is so strong and choosing a outcome pole, seems detectable and preventable. Hardly a large scale alignment issues because it is so overt.