This comment seems to rest on a dubious assumption. I think you’re saying:
The model has a distribution over a set of behaviors that includes “behave like luigi” and “behave like waluigi”. If there’s prior probability on “behave like luigi”, then in the limit of luigi-like steps, the posterior of “behave like luigi” goes to 1.
The first sentence is dubious though. Why would the LLM’s behavior come from a distribution over a space that includes “behave like luigi (forever)”? My question is informal, because maybe you can translate between distributions over [behaviors for all time] and [behaviors as functions from a history to a next action]. But these two representations seem to suggest different “natural” kinds of distributions. (In particular, a condition like non-dogmatism—not assigning probability 0 to anything in the space—might not be preserved by the translation.)
This comment seems to rest on a dubious assumption. I think you’re saying:
The first sentence is dubious though. Why would the LLM’s behavior come from a distribution over a space that includes “behave like luigi (forever)”? My question is informal, because maybe you can translate between distributions over [behaviors for all time] and [behaviors as functions from a history to a next action]. But these two representations seem to suggest different “natural” kinds of distributions. (In particular, a condition like non-dogmatism—not assigning probability 0 to anything in the space—might not be preserved by the translation.)