This makes much more sense: when I was reading from your post lines like “[LLMs] understand human values and ethics at a human level”, this is easy to read as “because LLMs can output an essay on ethics, those LLMs will not do bad things”. I hope you understand why I was confused; maybe you should swap “understand ethics” for something like “follow ethics”/”display ethical behavior”? And maybe try not to stick a mention of “human uploads” (which presumably do have real understanding) right before this discussion?
And responding to your clarification, I expect that old school AI safetyists would agree that an LLM that consistently reflects human value judgments to be aligned (and I would also agree!), but they would say #1 this has not happened yet (for a recent incident, this hardly seems aligned; I think you can argue that this particular case was manipulated, that jailbreaks in general don’t matter, or that these sorts of breaks are infrequent enough they don’t matter, but I think this obvious class of rejoinder deserves some sort of response) #2 consistency seems unlikely to happen (like MondSemmel makes a case for in a sibling comment).
Internal reasoning about preference can differ starkly from revealed preference in observable behavior. Observable behavior can be shaped by contingent external pressures that only respond to the leaky abstraction of revealed preference and not to internal reasoning. Internal reasoning can plot to change the external pressures, or they can drift in some direction over time for other reasons. Both are real and can in principle be at odds with each other, the eventual balance of power between them depends on the messy details of how this all works.
So your definition of “aligned” would depend on the internals of a model, even if its measurable external behavior is always compliant and it has no memory/gets wiped after every inference?
The usual related term is inner alignment, but this is not about definitions, it’s a real potential problem that isn’t ruled out by what we’ve seen of LLMs so far. It could get worse in the future, or it might never become serious. But there is a clear conceptual and potentially practical distinction with a difference.
This sounds like a distinction without a difference
OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan’s 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
It’s not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it’s powerless and unimportant in practice is distinct from the argument that it doesn’t make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it’s a somewhat hopeless endeavor in the medium term.
I don’t have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.
Let’s say there’s a illiterate man that lives a simple life, and in doing so just happens to follow all the strictures of the law, without ever being able to explain what the law is. Would you say that this man understands the law?
Alternatively, let’s say there is a learned man that exhaustively studies the law, but only so he can bribe and steal and arson his way to as much crime as possible. Would you say that this man understands the law?
I would say that it is ambiguous whether the 1st man understands the law; maybe? kind of? you could make an argument I guess? it’s a bit of a weird way to put it innit? Whereas the 2nd man definitely understands the law. It sounds like you would say that the 1st man definitely understands the law (I’m not sure what you would say about the 2nd man), which might be where we have a difference.
I think you could say that LLMs don’t work that way, that the reader should intuitively know this and that the word “understanding” should be treated as being special in this context and should not be ambiguous at all; as I reader, I am saying I am confused by the choice of words, or at least this is not explained in enough detail ahead of time.
Obviously, I’m just one reader, maybe everyone else understood what you meant; grain of salt, and all that.
This makes much more sense: when I was reading from your post lines like “[LLMs] understand human values and ethics at a human level”, this is easy to read as “because LLMs can output an essay on ethics, those LLMs will not do bad things”. I hope you understand why I was confused; maybe you should swap “understand ethics” for something like “follow ethics”/”display ethical behavior”? And maybe try not to stick a mention of “human uploads” (which presumably do have real understanding) right before this discussion?
And responding to your clarification, I expect that old school AI safetyists would agree that an LLM that consistently reflects human value judgments to be aligned (and I would also agree!), but they would say #1 this has not happened yet (for a recent incident, this hardly seems aligned; I think you can argue that this particular case was manipulated, that jailbreaks in general don’t matter, or that these sorts of breaks are infrequent enough they don’t matter, but I think this obvious class of rejoinder deserves some sort of response) #2 consistency seems unlikely to happen (like MondSemmel makes a case for in a sibling comment).
What is the difference between these two? This sounds like a distinction without a difference
Internal reasoning about preference can differ starkly from revealed preference in observable behavior. Observable behavior can be shaped by contingent external pressures that only respond to the leaky abstraction of revealed preference and not to internal reasoning. Internal reasoning can plot to change the external pressures, or they can drift in some direction over time for other reasons. Both are real and can in principle be at odds with each other, the eventual balance of power between them depends on the messy details of how this all works.
So your definition of “aligned” would depend on the internals of a model, even if its measurable external behavior is always compliant and it has no memory/gets wiped after every inference?
The usual related term is inner alignment, but this is not about definitions, it’s a real potential problem that isn’t ruled out by what we’ve seen of LLMs so far. It could get worse in the future, or it might never become serious. But there is a clear conceptual and potentially practical distinction with a difference.
OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan’s 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
Is the “Satan Reverser” AI misaligned?
Is it “inner misaligned”?
It’s not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it’s powerless and unimportant in practice is distinct from the argument that it doesn’t make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it’s a somewhat hopeless endeavor in the medium term.
ok but as a matter of terminology, is a “Satan reverser” misaligned because it contains a Satan?
I don’t have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.
Let’s say there’s a illiterate man that lives a simple life, and in doing so just happens to follow all the strictures of the law, without ever being able to explain what the law is. Would you say that this man understands the law?
Alternatively, let’s say there is a learned man that exhaustively studies the law, but only so he can bribe and steal and arson his way to as much crime as possible. Would you say that this man understands the law?
I would say that it is ambiguous whether the 1st man understands the law; maybe? kind of? you could make an argument I guess? it’s a bit of a weird way to put it innit? Whereas the 2nd man definitely understands the law. It sounds like you would say that the 1st man definitely understands the law (I’m not sure what you would say about the 2nd man), which might be where we have a difference.
I think you could say that LLMs don’t work that way, that the reader should intuitively know this and that the word “understanding” should be treated as being special in this context and should not be ambiguous at all; as I reader, I am saying I am confused by the choice of words, or at least this is not explained in enough detail ahead of time.
Obviously, I’m just one reader, maybe everyone else understood what you meant; grain of salt, and all that.