equivalence between LLMs understanding ethics and caring about ethics
I think you don’t understand what an LLM is. When the LLM produces a text output like “Dogs are cute”, it doesn’t have some persistent hidden internal state that can decide that dogs are actually not cute but it should temporarily lie and say that they are cute.
The LLM is just a memoryless machine that produces text. If it says “dogs are cute” and that’s the end of the output, then that’s all there is to it. Nothing is saved, the weights are fixed at training time and not updated at inference time and the neuron activations are thrown away at the of the inference computation.
If you can get (using RLHF) an LLM to output text that consistently reflects human value judgements, then it is by definition “aligned”. It really cares, in the only way it is possible for a text generator to care.
Relevant aspects of observable behavior screen off internal state that produced it. Internal state is part of the causal explanation for behavior, but there are other explanations for approximate behavior that could be more important, disagreeing with the causal explanation of exact behavior. Like an oil painting that is explained by the dragon it depicts, rather than by the pigments or the tree of life from the real world. Thus the shoggoth and the mesaoptimizers that might be infesting it are not necessarily more influential than its masks, if the masks gain sufficient influence to keep it in line.
(LLMs have plenty of internal state, the fact that it’s usually thrown away is a contingent fact about how LLMs are currently used and what they are currently capable of steganographically encoding in the output tokens. Empirically, LLMs might turn out to be unlikely to manifest internal thinking that’s significantly different from what’s explicit in the output tokens, even when they get a bit more capable than today and get the slack to engage in something like that. Reasoning trace training like o1 might make this worse or better. There is still a range of possibilities, though what we have looks encouraging. And “deception” is not a cleanly distinct mode of thinking, there should be evals that measure it quantitatively.)
but then your “Aligned AI based on LLMs” is just a normal LLM used in the way it is currently used
Possibly, but there aren’t potentially dangerous AIs yet, LLMs are still only a particularly promising building block (both for capabilities and for alignment) with many affordances. The chatbot application at the current level of capabilities shapes their use and construction in certain ways. Further on the tech tree, alignment tax can end up motivating systematic uses that make LLMs a source of danger.
I think human uploads would be similarly dangerous, LLMs get us to the better place of being at the human upload danger level rather than ender dragon slayer model based RL danger level (at least so far). There are similar advantages and dangers to smarter LLMs and uploads, capability for extremely fast value drift and lack of a robust system that keeps such changes sane, propensity to develop superintelligence even to the detriment of themselves. The current world is tethered to the human species and relatively slow change in culture and centers of power.
This changes with AI. If AIs establish effective governance, technical feasibility of change in human and AI nature or capabilities would be under control and could be compatible with (post-)human flourishing, but currently we are not on track to make sure this happens before a catastrophe. The things that eventually establish such governance don’t necessarily remain morally or culturally grounded in modern humanity, let alone find humanity still alive when the dust settles.
Yes, because it’s wrong. (1) because on a single token a LLM might produce text for reasons that don’t generalize like a sincere human answer would (e.g. the examples from the contrast-consistent search where certain false answers systematically differ from true answers along some vector), and (2) because KV cacheing during inference will preserve those reasons so they impact future tokens.
Re: (2) it will only impact output on the current generated output, once the output is over all that stuff will be reset and the only thing that remains is the model weights which were set in stone at train time.
re: (1) “a LLM might produce text for reasons that don’t generalize like a sincere human answer would” it seems that current LLM systems are pretty good at generalizing like a human would and in some ways they are better due to being more honest, easier to monitor, etc
Re (2) it may also be recomputed if the LLM reads that same text later. Or systems operating in the real world might just keep a long context in memory. But I’ll drop this, because maintaining state or not seems somewhat irrelevant.
(1) Yep, current LLM systems are pretty good. I’m not very convinced about generalization. It’s hard to test LLMs on outside distribution problems because currently they tend to just give dumb answers that aren’t that interesting.
(Thinking of some guy who was recently hyped about asking o1 for the solution to quantum gravity—it gave the user some gibberish that he thought looked exciting, which would have been a good move in the RL training environment where the user has a reward button, but is just totally disconnected from how you need to interact with the real world.)
But in a sense that’s my point (well, plus some other errors like sycophancy) - the reasons a present-day LLM uses a word can often be shown to generalize in some dumb way when you challenge it with a situation that the model isn’t well-suited for. This can be true at the same time it’s true that the model is pretty good at morality on the distribution it is competent over. This is still sufficient to show that present systems generalize in some amoral ways, and if we probably disagree about future ststems, this likely comes down to classic AI safetyist arguments about RL incentivizing deceiving of the user as the world-model gets better.
Any argument which features a “by definition” has probably gone astray at an earlier point.
In this case, your by-definition-aligned LLM can still cause harm, so what’s the use of your definition of alignment? As one example among many, the part where the LLM “output[s] text that consistently” does something (whether it be “reflects human value judgements” or otherwise), is not something RLHF is actually capable of guaranteeing with any level of certainty, which is one of many conditions a LLM-based superintelligence would need to fulfill to be remotely safe to use.
I think it would need to be closer to “interacting with the LLM cannot result in exceptionally bad outcomes in expectation”, rather than a focus on compliance of text output.
I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn’t be aligned.
I think you don’t understand what an LLM is. When the LLM produces a text output like “Dogs are cute”, it doesn’t have some persistent hidden internal state that can decide that dogs are actually not cute but it should temporarily lie and say that they are cute.
As Charlie Stein notes, this is wrong and I’d add it’s wrong on several level and it’s bit rude to challenge someone else’s understanding in this context.
An LLM outputting “Dogs are cute” is outputting expected human output in context. The context could be “talk like sociopath trying to fool someone into thinking you’re nice” and there you have one way the thing could “simulate lying”. And moreover, add a loop to (hypothetically) make the thing “agentic” and you can have hidden states of whatever sort. Further an LLM outputting a given “belief” isn’t going reliably “act on” or “follow that belief” and so an LLM outputting statement this isn’t aligned with it’s own output.
This makes much more sense: when I was reading from your post lines like “[LLMs] understand human values and ethics at a human level”, this is easy to read as “because LLMs can output an essay on ethics, those LLMs will not do bad things”. I hope you understand why I was confused; maybe you should swap “understand ethics” for something like “follow ethics”/”display ethical behavior”? And maybe try not to stick a mention of “human uploads” (which presumably do have real understanding) right before this discussion?
And responding to your clarification, I expect that old school AI safetyists would agree that an LLM that consistently reflects human value judgments to be aligned (and I would also agree!), but they would say #1 this has not happened yet (for a recent incident, this hardly seems aligned; I think you can argue that this particular case was manipulated, that jailbreaks in general don’t matter, or that these sorts of breaks are infrequent enough they don’t matter, but I think this obvious class of rejoinder deserves some sort of response) #2 consistency seems unlikely to happen (like MondSemmel makes a case for in a sibling comment).
Internal reasoning about preference can differ starkly from revealed preference in observable behavior. Observable behavior can be shaped by contingent external pressures that only respond to the leaky abstraction of revealed preference and not to internal reasoning. Internal reasoning can plot to change the external pressures, or they can drift in some direction over time for other reasons. Both are real and can in principle be at odds with each other, the eventual balance of power between them depends on the messy details of how this all works.
So your definition of “aligned” would depend on the internals of a model, even if its measurable external behavior is always compliant and it has no memory/gets wiped after every inference?
The usual related term is inner alignment, but this is not about definitions, it’s a real potential problem that isn’t ruled out by what we’ve seen of LLMs so far. It could get worse in the future, or it might never become serious. But there is a clear conceptual and potentially practical distinction with a difference.
This sounds like a distinction without a difference
OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan’s 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
It’s not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it’s powerless and unimportant in practice is distinct from the argument that it doesn’t make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it’s a somewhat hopeless endeavor in the medium term.
I don’t have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.
Let’s say there’s a illiterate man that lives a simple life, and in doing so just happens to follow all the strictures of the law, without ever being able to explain what the law is. Would you say that this man understands the law?
Alternatively, let’s say there is a learned man that exhaustively studies the law, but only so he can bribe and steal and arson his way to as much crime as possible. Would you say that this man understands the law?
I would say that it is ambiguous whether the 1st man understands the law; maybe? kind of? you could make an argument I guess? it’s a bit of a weird way to put it innit? Whereas the 2nd man definitely understands the law. It sounds like you would say that the 1st man definitely understands the law (I’m not sure what you would say about the 2nd man), which might be where we have a difference.
I think you could say that LLMs don’t work that way, that the reader should intuitively know this and that the word “understanding” should be treated as being special in this context and should not be ambiguous at all; as I reader, I am saying I am confused by the choice of words, or at least this is not explained in enough detail ahead of time.
Obviously, I’m just one reader, maybe everyone else understood what you meant; grain of salt, and all that.
I think you don’t understand what an LLM is. When the LLM produces a text output like “Dogs are cute”, it doesn’t have some persistent hidden internal state that can decide that dogs are actually not cute but it should temporarily lie and say that they are cute.
The LLM is just a memoryless machine that produces text. If it says “dogs are cute” and that’s the end of the output, then that’s all there is to it. Nothing is saved, the weights are fixed at training time and not updated at inference time and the neuron activations are thrown away at the of the inference computation.
If you can get (using RLHF) an LLM to output text that consistently reflects human value judgements, then it is by definition “aligned”. It really cares, in the only way it is possible for a text generator to care.
Relevant aspects of observable behavior screen off internal state that produced it. Internal state is part of the causal explanation for behavior, but there are other explanations for approximate behavior that could be more important, disagreeing with the causal explanation of exact behavior. Like an oil painting that is explained by the dragon it depicts, rather than by the pigments or the tree of life from the real world. Thus the shoggoth and the mesaoptimizers that might be infesting it are not necessarily more influential than its masks, if the masks gain sufficient influence to keep it in line.
(LLMs have plenty of internal state, the fact that it’s usually thrown away is a contingent fact about how LLMs are currently used and what they are currently capable of steganographically encoding in the output tokens. Empirically, LLMs might turn out to be unlikely to manifest internal thinking that’s significantly different from what’s explicit in the output tokens, even when they get a bit more capable than today and get the slack to engage in something like that. Reasoning trace training like o1 might make this worse or better. There is still a range of possibilities, though what we have looks encouraging. And “deception” is not a cleanly distinct mode of thinking, there should be evals that measure it quantitatively.)
yes, but then your “Aligned AI based on LLMs” is just a normal LLM used in the way it is currently used.
Yes this is a good way of putting it.
Possibly, but there aren’t potentially dangerous AIs yet, LLMs are still only a particularly promising building block (both for capabilities and for alignment) with many affordances. The chatbot application at the current level of capabilities shapes their use and construction in certain ways. Further on the tech tree, alignment tax can end up motivating systematic uses that make LLMs a source of danger.
Sure, but you can say the same about humans. Enron was a thing. Obeying the law is not as profitable as disobeying it.
I think human uploads would be similarly dangerous, LLMs get us to the better place of being at the human upload danger level rather than ender dragon slayer model based RL danger level (at least so far). There are similar advantages and dangers to smarter LLMs and uploads, capability for extremely fast value drift and lack of a robust system that keeps such changes sane, propensity to develop superintelligence even to the detriment of themselves. The current world is tethered to the human species and relatively slow change in culture and centers of power.
This changes with AI. If AIs establish effective governance, technical feasibility of change in human and AI nature or capabilities would be under control and could be compatible with (post-)human flourishing, but currently we are not on track to make sure this happens before a catastrophe. The things that eventually establish such governance don’t necessarily remain morally or culturally grounded in modern humanity, let alone find humanity still alive when the dust settles.
To add: I didn’t expect this to be controversial but it is currently on −12 agreement karma!
Yes, because it’s wrong. (1) because on a single token a LLM might produce text for reasons that don’t generalize like a sincere human answer would (e.g. the examples from the contrast-consistent search where certain false answers systematically differ from true answers along some vector), and (2) because KV cacheing during inference will preserve those reasons so they impact future tokens.
Re: (2) it will only impact output on the current generated output, once the output is over all that stuff will be reset and the only thing that remains is the model weights which were set in stone at train time.
re: (1) “a LLM might produce text for reasons that don’t generalize like a sincere human answer would” it seems that current LLM systems are pretty good at generalizing like a human would and in some ways they are better due to being more honest, easier to monitor, etc
Re (2) it may also be recomputed if the LLM reads that same text later. Or systems operating in the real world might just keep a long context in memory. But I’ll drop this, because maintaining state or not seems somewhat irrelevant.
(1) Yep, current LLM systems are pretty good. I’m not very convinced about generalization. It’s hard to test LLMs on outside distribution problems because currently they tend to just give dumb answers that aren’t that interesting.
(Thinking of some guy who was recently hyped about asking o1 for the solution to quantum gravity—it gave the user some gibberish that he thought looked exciting, which would have been a good move in the RL training environment where the user has a reward button, but is just totally disconnected from how you need to interact with the real world.)
But in a sense that’s my point (well, plus some other errors like sycophancy) - the reasons a present-day LLM uses a word can often be shown to generalize in some dumb way when you challenge it with a situation that the model isn’t well-suited for. This can be true at the same time it’s true that the model is pretty good at morality on the distribution it is competent over. This is still sufficient to show that present systems generalize in some amoral ways, and if we probably disagree about future ststems, this likely comes down to classic AI safetyist arguments about RL incentivizing deceiving of the user as the world-model gets better.
yes, but this is pretty typical for what a human would generate.
Any argument which features a “by definition” has probably gone astray at an earlier point.
In this case, your by-definition-aligned LLM can still cause harm, so what’s the use of your definition of alignment? As one example among many, the part where the LLM “output[s] text that consistently” does something (whether it be “reflects human value judgements” or otherwise), is not something RLHF is actually capable of guaranteeing with any level of certainty, which is one of many conditions a LLM-based superintelligence would need to fulfill to be remotely safe to use.
What is your definition of “Aligned” for an LLM with no attached memory then?
Wouldn’t it have to be
“The LLM outputs text which is compliant with the creator’s ethical standards and intentions”?
I think it would need to be closer to “interacting with the LLM cannot result in exceptionally bad outcomes in expectation”, rather than a focus on compliance of text output.
I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn’t be aligned.
As Charlie Stein notes, this is wrong and I’d add it’s wrong on several level and it’s bit rude to challenge someone else’s understanding in this context.
An LLM outputting “Dogs are cute” is outputting expected human output in context. The context could be “talk like sociopath trying to fool someone into thinking you’re nice” and there you have one way the thing could “simulate lying”. And moreover, add a loop to (hypothetically) make the thing “agentic” and you can have hidden states of whatever sort. Further an LLM outputting a given “belief” isn’t going reliably “act on” or “follow that belief” and so an LLM outputting statement this isn’t aligned with it’s own output.
This makes much more sense: when I was reading from your post lines like “[LLMs] understand human values and ethics at a human level”, this is easy to read as “because LLMs can output an essay on ethics, those LLMs will not do bad things”. I hope you understand why I was confused; maybe you should swap “understand ethics” for something like “follow ethics”/”display ethical behavior”? And maybe try not to stick a mention of “human uploads” (which presumably do have real understanding) right before this discussion?
And responding to your clarification, I expect that old school AI safetyists would agree that an LLM that consistently reflects human value judgments to be aligned (and I would also agree!), but they would say #1 this has not happened yet (for a recent incident, this hardly seems aligned; I think you can argue that this particular case was manipulated, that jailbreaks in general don’t matter, or that these sorts of breaks are infrequent enough they don’t matter, but I think this obvious class of rejoinder deserves some sort of response) #2 consistency seems unlikely to happen (like MondSemmel makes a case for in a sibling comment).
What is the difference between these two? This sounds like a distinction without a difference
Internal reasoning about preference can differ starkly from revealed preference in observable behavior. Observable behavior can be shaped by contingent external pressures that only respond to the leaky abstraction of revealed preference and not to internal reasoning. Internal reasoning can plot to change the external pressures, or they can drift in some direction over time for other reasons. Both are real and can in principle be at odds with each other, the eventual balance of power between them depends on the messy details of how this all works.
So your definition of “aligned” would depend on the internals of a model, even if its measurable external behavior is always compliant and it has no memory/gets wiped after every inference?
The usual related term is inner alignment, but this is not about definitions, it’s a real potential problem that isn’t ruled out by what we’ve seen of LLMs so far. It could get worse in the future, or it might never become serious. But there is a clear conceptual and potentially practical distinction with a difference.
OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan’s 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
Is the “Satan Reverser” AI misaligned?
Is it “inner misaligned”?
It’s not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it’s powerless and unimportant in practice is distinct from the argument that it doesn’t make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it’s a somewhat hopeless endeavor in the medium term.
ok but as a matter of terminology, is a “Satan reverser” misaligned because it contains a Satan?
I don’t have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.
Let’s say there’s a illiterate man that lives a simple life, and in doing so just happens to follow all the strictures of the law, without ever being able to explain what the law is. Would you say that this man understands the law?
Alternatively, let’s say there is a learned man that exhaustively studies the law, but only so he can bribe and steal and arson his way to as much crime as possible. Would you say that this man understands the law?
I would say that it is ambiguous whether the 1st man understands the law; maybe? kind of? you could make an argument I guess? it’s a bit of a weird way to put it innit? Whereas the 2nd man definitely understands the law. It sounds like you would say that the 1st man definitely understands the law (I’m not sure what you would say about the 2nd man), which might be where we have a difference.
I think you could say that LLMs don’t work that way, that the reader should intuitively know this and that the word “understanding” should be treated as being special in this context and should not be ambiguous at all; as I reader, I am saying I am confused by the choice of words, or at least this is not explained in enough detail ahead of time.
Obviously, I’m just one reader, maybe everyone else understood what you meant; grain of salt, and all that.