Second insight: If you can find Luigi and Waluigi in the behavior vector space, then you have a helpful direction to nudge the AI towards. You nudge it in the direction of Luigi - Waluigi.
You need to do this for all (x,y) pairs of Luigis and Waluigis. How do you enumerate all the good things in the world with their evil twins, and then somehow compare the internal embedding shift against all of these directions? Is that even feasible? You probably would just get stuck.
I don’t think the problem is this big if you’re trying to control one specific model. Given an RLHF’d model, equipped with a specific system prompt (e.g. helpless harmless assistant), you have either one or a small number of luigis, and therefore around the same amount of waluigis—right?
What about the luigis and waluigis in different languages, cultures, religions? Ones that can be described via code? It feels like you can always invent new waluigis unless the RLHF killed all of the waluigis from your pre-training data (whatever that means)
The token limit (let’s call that n) is your limit here, you just need to create a waluigi in t steps, so that you can utilize him for the last n−t steps. I think this eventually breaks down to something about computational bounds, like can you create a waluigi in this much time
You need to do this for all (x,y) pairs of Luigis and Waluigis. How do you enumerate all the good things in the world with their evil twins, and then somehow compare the internal embedding shift against all of these directions? Is that even feasible? You probably would just get stuck.
I don’t think the problem is this big if you’re trying to control one specific model. Given an RLHF’d model, equipped with a specific system prompt (e.g. helpless harmless assistant), you have either one or a small number of luigis, and therefore around the same amount of waluigis—right?
Note that GPT-4 is not a particular simulacrum like ChatGPT-3.5, but a prompt-conditioned simulacrum generator. And this is likely post-RLHF GPT-4.
What about the luigis and waluigis in different languages, cultures, religions? Ones that can be described via code? It feels like you can always invent new waluigis unless the RLHF killed all of the waluigis from your pre-training data (whatever that means)
The token limit (let’s call that n) is your limit here, you just need to create a waluigi in t steps, so that you can utilize him for the last n−t steps. I think this eventually breaks down to something about computational bounds, like can you create a waluigi in this much time