Internal reasoning about preference can differ starkly from revealed preference in observable behavior. Observable behavior can be shaped by contingent external pressures that only respond to the leaky abstraction of revealed preference and not to internal reasoning. Internal reasoning can plot to change the external pressures, or they can drift in some direction over time for other reasons. Both are real and can in principle be at odds with each other, the eventual balance of power between them depends on the messy details of how this all works.
So your definition of “aligned” would depend on the internals of a model, even if its measurable external behavior is always compliant and it has no memory/gets wiped after every inference?
The usual related term is inner alignment, but this is not about definitions, it’s a real potential problem that isn’t ruled out by what we’ve seen of LLMs so far. It could get worse in the future, or it might never become serious. But there is a clear conceptual and potentially practical distinction with a difference.
This sounds like a distinction without a difference
OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan’s 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
It’s not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it’s powerless and unimportant in practice is distinct from the argument that it doesn’t make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it’s a somewhat hopeless endeavor in the medium term.
I don’t have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.
Internal reasoning about preference can differ starkly from revealed preference in observable behavior. Observable behavior can be shaped by contingent external pressures that only respond to the leaky abstraction of revealed preference and not to internal reasoning. Internal reasoning can plot to change the external pressures, or they can drift in some direction over time for other reasons. Both are real and can in principle be at odds with each other, the eventual balance of power between them depends on the messy details of how this all works.
So your definition of “aligned” would depend on the internals of a model, even if its measurable external behavior is always compliant and it has no memory/gets wiped after every inference?
The usual related term is inner alignment, but this is not about definitions, it’s a real potential problem that isn’t ruled out by what we’ve seen of LLMs so far. It could get worse in the future, or it might never become serious. But there is a clear conceptual and potentially practical distinction with a difference.
OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan’s 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
Is the “Satan Reverser” AI misaligned?
Is it “inner misaligned”?
It’s not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it’s powerless and unimportant in practice is distinct from the argument that it doesn’t make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it’s a somewhat hopeless endeavor in the medium term.
ok but as a matter of terminology, is a “Satan reverser” misaligned because it contains a Satan?
I don’t have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.