Roger Dearnaley comments on Agentized LLMs will change the alignment landscape

Roger Dearnaley 22 Apr 2023 4:44 UTC
3 points
0
I agree that an Auto-GPT style agent using self-editing context is likely to communicate in a natural language. Probably English, as the language GPT-4′s training data contained the most of. A future Auto-GPT-N and attempting to avoid monitoring of its stream-of-consciousness might start speaking a more obscure language, or doing a simple encryption like Base64 encoding: it’s not hard to get GPT3.5 to take input and give output in Base64, but it’s enough extra cognitive load per token, and/or messes up the token embeddings enough, that the resulting very-lightly-encrypted LLM is fairly brain damaged. So this is an example of a ‘misalignment tax’, where attempting to be deceptive imposes a significant additional cost.
Of course Base64 isn’t a cryptographically strong encryption scheme. Even for some very smart future AutoGPT-N running on GPT-N, it couldn’t use encryption in its context stream without somewhere to put the key (or some riddle or set of trivia questions whose answer was the key, or whatever). However, if smart enough, it might be able to do something like steganographically concealing a side-channel message in an innocent-looking English text, say as some form of lightly-encrypted/compressed acrostic, but it would have to be smart enough that every time it ran GPT-N against the text to generate another token, that spotted the acrostic, figured out how to decrypt it, paid attention to the side-channel message rather than the main one, and played along and continued this exercise. Not theoretically impossible for something enough smarter than a human, but the misalignment tax gets a good deal bigger, and it’s not obvious why something trained on imitating text generated by humans would learn this ability, except through inadvertent fine-tuning on text contaminated with such acrostic stenographic side-channels embedded. Obviously an aligned, comparably smart AutoGPT-N not paying the no-alignment overhead of having to steganographically conceal its thoughts that was used for monitoring the misaligned Auto-GPT-N shouldn’t find it too hard to catch on to what’s happening here, even if a human couldn’t do it. If nothing else, the pressure of having to fulfill the acrostic requirements for the side-channel is going to make it hard to make the innocent-looking text plausible, presumably raising its perplexity when analyzed by an LLM that hasn’t spotted the side-channel.