It seems as though it should be possible to remove the Waluigi effect[1] by appropriately training a model.
Particularly, some combination of:
Removing data from the training that matches this effect
Constructing new synthetic data which performs the opposite of the Waluigi effect
However, removing this effect might be problematic for certain situations where we want the ability to generate such content, for example, if we want it to write a story.
In this case, it might pay to add back the ability to generate such content within certain tags (ie. <story></story>), but train it not to produce such content otherwise.
It seems as though it should be possible to remove the Waluigi effect[1] by appropriately training a model.
Particularly, some combination of:
Removing data from the training that matches this effect
Constructing new synthetic data which performs the opposite of the Waluigi effect
However, removing this effect might be problematic for certain situations where we want the ability to generate such content, for example, if we want it to write a story.
In this case, it might pay to add back the ability to generate such content within certain tags (ie. <story></story>), but train it not to produce such content otherwise.
Insofar as it exists. Surprisingly, appearing on Know Your Meme, does not count as very strong evidence