TristanTrim comments on The Waluigi Effect (mega-post)

TristanTrim 18 Jul 2023 7:42 UTC
1 point
0
The waluigis will give anti-croissant responses
I’d say the waluigis have a higher probability of giving pro-croissant responses than the luigi’s, and are therefore genuinely selected against. The reinforcement learning is not part of the story, it is the thing selecting for the LLM distribution based on whether the content of the story contained pro or anti croissant propaganda.
(Note that this doesn’t apply to future, agent shaped, AI (made of LLM components) which are aware of their status (subject to “training” alteration) as part of the story they are working on)