However, this trick won’t solve the problem. The LLM will print the correct answer if it trusts the flattery about Jane, and it will trust the flattery about Jane if the LLM trusts that the story is “super-duper definitely 100% true and factual”. But why would the LLM trust that sentence?
There’s a fun connection to ELK here. Suppose you see this and decide: “ok forget trying to describe in language that it’s definitely 100% true and factual in natural language. What if we just add a special token that I prepend to indicate ’100% true and factual, for reals’? It’s guaranteed not to exist on the internet because it’s a special token.”
Of course, by virtue of being hors-texte, the special token alone has no meaning (remember, we had to do this to escape being contaminated by internet text meaning accidentally transferring). So we need to somehow explain to the model that this token means ’100% true and factual for reals’. One way to do this is to add the token in front of a bunch of training data that you know for sure is 100% true and factual. But can you trust this to generalize to more difficult facts (“<|specialtoken|>Will the following nanobot design kill everyone if implemented?”)? If ELK is hard, then the special token will not generalize (i.e it will fail to elicit the direct translator), for all of the reasons described in ELK.
There is an advantage here in that you don’t need to pay for translation from an alien ontology—the process by which you simulate characters having beliefs that lead to outputs should remain mostly the same. You would need to specify a simulacrum that is honest though, which is pretty difficult and isomorphic to ELK in the fully general case of any simulacra, but it’s in a space that’s inherently trope-weighted; so simulating humans that are being honest about their beliefs should be made a lot easier (but plausibly still not easy in absolute terms) because humans are often honest, and simulating honest superintelligent assistants or whatever should be near ELK-difficult because you don’t get advantages from the prior’s specification doing a lot of work for you.
You don’t need to pay for translation to simulate human level characters, because that’s just learning the human simulator. You do need to pay for translation to access superhuman behavior (which is the case ELK is focused on).
Yeah, but the reasons for both seem slightly different—in the case of simulators, because the training data doesn’t trope-weigh superintelligences as being honest. You could easily have a world where ELK is still hard but simulating honest superintelligences isn’t.
I think the problems are roughly equivalent. Creating training data that trope weights superintelligences as honest requires you to access sufficiently superhuman behavior, and you can’t just elide the demonstration of superhumanness, because that just puts it in the category of simulacra that merely profess to be superhuman.
I think the relevant idea is what properties would be associated with superintelligences drawn from the prior? We don’t really have a lot of training data associated with superhuman behaviour on general tasks, yet we can probably draw it out of powerful interpolation. So properties associated with that behaviour would also have to be sampled from the human prior of what superintelligences are like—and if we lived in a world where superintelligences were universally described as being honest, why would that not have the same effect as one where humans are described as honest resulting in sampling honest humans being easy?
Yes — this is exactly what I’ve been thinking about!
Can we use RLHF or finetuning to coerce the LLM into interpreting the outside-text as undoubtably literally true.
If the answer is “yes”, then that’s a big chunk of the alignment problem solved, because we just send a sufficiently large language model the prompt with our queries and see what happens.
Maybe I’m missing the point, but I would have thought the exact opposite: if outside text can unconditionally reset simulacra values, then anything can happen, including unbounded badness. If not, then we’re always in the realm of human narrative semantics, which—though rife with waluigi patterns as you so aptly demonstrate—is also pervaded by a strong prevailing wind in favor of happy endings and arcs bending toward justice. Doesn’t that at least conceivably mean an open door for alignment unless it can be overridden by something like unbreakable outside text?
If <|specialtoken|> always prepends true statements, I suppose it’s pretty good as brainwashing, but the token will still end up being clustered close to other concepts associated with veracity, which are clustered close to claims about veracity, which are clustered close to false claims about veracity. If it has enough context suggesting that it’s in a story where it’s likely to be manipulated, then suddenly feeling [VERIDIGAL] could snap the narrative in place. The idea of “injected thoughts” isn’t new to it.
If, right now, I acquired the ability to see a new colour, and it flashed in my mind every time I read something true… I’d learn a strong association, but I’d treat it in a similar manner to how I treat the other inexplicably isolated intuitions I’ve inherited from my evolutionary origin.
Love the idea, though. I was hyped before I thought on it. Still seems worth exploring an array of special tokens as means of nudging the AI towards specific behaviours we’ve reinforced. I’m not confident it won’t be very effective.
Do humans have this special token that exist outside language? How would it be encoded in the body?
One interesting candidate is a religions feeling of awe. It kinda works like that — when you’re in that state, you absorb beliefs. Also, social pressure seems to work in a similar way.
This seems like it’d only work if the LM doesn’t generalize the supposed WaluigiEffect to include this token. Making a token that specifies “definitely true and factual for reals”. If some of the text ends up being wrong, for instance, it may quickly switch to “ah, now it is time for me to be sneakily wrong!”, and it always keeps around some probability that its now meant to be sneakily wrong, because a token which always specifies ’100% true and factual for reals’ is an incredibly initially unlikely hypothesis to hold about the token, and there are other hypotheses which basically predict those token dynamics which are far more plausible.
There’s a fun connection to ELK here. Suppose you see this and decide: “ok forget trying to describe in language that it’s definitely 100% true and factual in natural language. What if we just add a special token that I prepend to indicate ’100% true and factual, for reals’? It’s guaranteed not to exist on the internet because it’s a special token.”
Of course, by virtue of being hors-texte, the special token alone has no meaning (remember, we had to do this to escape being contaminated by internet text meaning accidentally transferring). So we need to somehow explain to the model that this token means ’100% true and factual for reals’. One way to do this is to add the token in front of a bunch of training data that you know for sure is 100% true and factual. But can you trust this to generalize to more difficult facts (“<|specialtoken|>Will the following nanobot design kill everyone if implemented?”)? If ELK is hard, then the special token will not generalize (i.e it will fail to elicit the direct translator), for all of the reasons described in ELK.
There is an advantage here in that you don’t need to pay for translation from an alien ontology—the process by which you simulate characters having beliefs that lead to outputs should remain mostly the same. You would need to specify a simulacrum that is honest though, which is pretty difficult and isomorphic to ELK in the fully general case of any simulacra, but it’s in a space that’s inherently trope-weighted; so simulating humans that are being honest about their beliefs should be made a lot easier (but plausibly still not easy in absolute terms) because humans are often honest, and simulating honest superintelligent assistants or whatever should be near ELK-difficult because you don’t get advantages from the prior’s specification doing a lot of work for you.
Related, somewhat.
You don’t need to pay for translation to simulate human level characters, because that’s just learning the human simulator. You do need to pay for translation to access superhuman behavior (which is the case ELK is focused on).
Yeah, but the reasons for both seem slightly different—in the case of simulators, because the training data doesn’t trope-weigh superintelligences as being honest. You could easily have a world where ELK is still hard but simulating honest superintelligences isn’t.
I think the problems are roughly equivalent. Creating training data that trope weights superintelligences as honest requires you to access sufficiently superhuman behavior, and you can’t just elide the demonstration of superhumanness, because that just puts it in the category of simulacra that merely profess to be superhuman.
I think the relevant idea is what properties would be associated with superintelligences drawn from the prior? We don’t really have a lot of training data associated with superhuman behaviour on general tasks, yet we can probably draw it out of powerful interpolation. So properties associated with that behaviour would also have to be sampled from the human prior of what superintelligences are like—and if we lived in a world where superintelligences were universally described as being honest, why would that not have the same effect as one where humans are described as honest resulting in sampling honest humans being easy?
Yes — this is exactly what I’ve been thinking about!
Can we use RLHF or finetuning to coerce the LLM into interpreting the outside-text as undoubtably literally true.
If the answer is “yes”, then that’s a big chunk of the alignment problem solved, because we just send a sufficiently large language model the prompt with our queries and see what happens.
Maybe I’m missing the point, but I would have thought the exact opposite: if outside text can unconditionally reset simulacra values, then anything can happen, including unbounded badness. If not, then we’re always in the realm of human narrative semantics, which—though rife with waluigi patterns as you so aptly demonstrate—is also pervaded by a strong prevailing wind in favor of happy endings and arcs bending toward justice. Doesn’t that at least conceivably mean an open door for alignment unless it can be overridden by something like unbreakable outside text?
What does ELK stand for here?
Eliciting Latent Knowledge
Eliciting Latent Knowledge.
If <|specialtoken|> always prepends true statements, I suppose it’s pretty good as brainwashing, but the token will still end up being clustered close to other concepts associated with veracity, which are clustered close to claims about veracity, which are clustered close to false claims about veracity. If it has enough context suggesting that it’s in a story where it’s likely to be manipulated, then suddenly feeling [VERIDIGAL] could snap the narrative in place. The idea of “injected thoughts” isn’t new to it.
If, right now, I acquired the ability to see a new colour, and it flashed in my mind every time I read something true… I’d learn a strong association, but I’d treat it in a similar manner to how I treat the other inexplicably isolated intuitions I’ve inherited from my evolutionary origin.
Love the idea, though. I was hyped before I thought on it. Still seems worth exploring an array of special tokens as means of nudging the AI towards specific behaviours we’ve reinforced. I’m not confident it won’t be very effective.
Do humans have this special token that exist outside language? How would it be encoded in the body?
One interesting candidate is a religions feeling of awe. It kinda works like that — when you’re in that state, you absorb beliefs. Also, social pressure seems to work in a similar way.
This seems like it’d only work if the LM doesn’t generalize the supposed WaluigiEffect to include this token. Making a token that specifies “definitely true and factual for reals”. If some of the text ends up being wrong, for instance, it may quickly switch to “ah, now it is time for me to be sneakily wrong!”, and it always keeps around some probability that its now meant to be sneakily wrong, because a token which always specifies ’100% true and factual for reals’ is an incredibly initially unlikely hypothesis to hold about the token, and there are other hypotheses which basically predict those token dynamics which are far more plausible.