TekhneMakre comments on The Waluigi Effect (mega-post)

TekhneMakre 4 Mar 2023 18:41 UTC
2 points
0
This comment seems to rest on a dubious assumption. I think you’re saying:

The model has a distribution over a set of behaviors that includes “behave like luigi” and “behave like waluigi”. If there’s prior probability on “behave like luigi”, then in the limit of luigi-like steps, the posterior of “behave like luigi” goes to 1.

The first sentence is dubious though. Why would the LLM’s behavior come from a distribution over a space that includes “behave like luigi (forever)”? My question is informal, because maybe you can translate between distributions over [behaviors for all time] and [behaviors as functions from a history to a next action]. But these two representations seem to suggest different “natural” kinds of distributions. (In particular, a condition like non-dogmatism—not assigning probability 0 to anything in the space—might not be preserved by the translation.)