Cleo Nardo comments on The Waluigi Effect (mega-post)

Cleo Nardo 3 Mar 2023 12:05 UTC
LW: 9 AF: 5
4
AF
Yes — this is exactly what I’ve been thinking about!

Can we use RLHF or finetuning to coerce the LLM into interpreting the outside-text as undoubtably literally true.
If the answer is “yes”, then that’s a big chunk of the alignment problem solved, because we just send a sufficiently large language model the prompt with our queries and see what happens.
- metasemi 6 Mar 2023 22:09 UTC
  1 point
  0
  Parent
  Maybe I’m missing the point, but I would have thought the exact opposite: if outside text can unconditionally reset simulacra values, then anything can happen, including unbounded badness. If not, then we’re always in the realm of human narrative semantics, which—though rife with waluigi patterns as you so aptly demonstrate—is also pervaded by a strong prevailing wind in favor of happy endings and arcs bending toward justice. Doesn’t that at least conceivably mean an open door for alignment unless it can be overridden by something like unbreakable outside text?