I think you don’t understand what an LLM is. When the LLM produces a text output like “Dogs are cute”, it doesn’t have some persistent hidden internal state that can decide that dogs are actually not cute but it should temporarily lie and say that they are cute.
As Charlie Stein notes, this is wrong and I’d add it’s wrong on several level and it’s bit rude to challenge someone else’s understanding in this context.
An LLM outputting “Dogs are cute” is outputting expected human output in context. The context could be “talk like sociopath trying to fool someone into thinking you’re nice” and there you have one way the thing could “simulate lying”. And moreover, add a loop to (hypothetically) make the thing “agentic” and you can have hidden states of whatever sort. Further an LLM outputting a given “belief” isn’t going reliably “act on” or “follow that belief” and so an LLM outputting statement this isn’t aligned with it’s own output.
As Charlie Stein notes, this is wrong and I’d add it’s wrong on several level and it’s bit rude to challenge someone else’s understanding in this context.
An LLM outputting “Dogs are cute” is outputting expected human output in context. The context could be “talk like sociopath trying to fool someone into thinking you’re nice” and there you have one way the thing could “simulate lying”. And moreover, add a loop to (hypothetically) make the thing “agentic” and you can have hidden states of whatever sort. Further an LLM outputting a given “belief” isn’t going reliably “act on” or “follow that belief” and so an LLM outputting statement this isn’t aligned with it’s own output.