cfoster0 comments on The Waluigi Effect (mega-post)

cfoster0 5 Mar 2023 3:52 UTC
7 points
2

okay here’s a simulacra-based argument — I’ll try try work out later if this can be converted into mechanistic DL, and if not then you can probably ignore it:

If you think you’ll have the time, I think that grounding out your intuitions into some mechanistically-plausible sketch is always a helpful exercise. Without it, intuitions and convenient frames can really lead you down the wrong path.

Imagine you start with a population of 1000 simulcra with different goals and traits, and then someone comes in (rlhf) and starts killing off various simulacra which are behaving badly. Then the rest of the simulacra see that and become deceptive so they don’t die.

Appreciate the concrete model. I think, roughly, “that’s not how this works”.