Chris_Leong comments on Chris_Leong’s Shortform

Chris_Leong 27 Jul 2023 8:30 UTC
2 points
It seems as though it should be possible to remove the Waluigi effect^[1] by appropriately training a model.
Particularly, some combination of:
- Removing data from the training that matches this effect
- Constructing new synthetic data which performs the opposite of the Waluigi effect
However, removing this effect might be problematic for certain situations where we want the ability to generate such content, for example, if we want it to write a story.

In this case, it might pay to add back the ability to generate such content within certain tags (ie. <story></story>), but train it not to produce such content otherwise.
1. ^
  Insofar as it exists. Surprisingly, appearing on Know Your Meme, does not count as very strong evidence