Emulating GPT-4 using LLMs like GPT-3 as different submodules that send messages written in plain English to each other before outputting the next token. If the neural network had deceptive thoughts, we could see them in these intermediate messages.
This doesn’t account for the possibility that there’s still stenography involved. Plain English coming from an LLM may not be so plain given
33. Alien Concepts: “The AI does not think like you do” There may not necessarily be a humanly understandable explanation for cognition done by crunching numbers through matrix products.
Considering current language models are able to create their own “language” to communicate with each other without context (hit or miss, admittedly), who’s to say a deceptive model could find a way to hide misaligned thoughts in human language, like puzzles that spell a message using the first letter of every fourth word in a sentence? There could be some arbitrarily complicated algorithm (i.e., https://twitter.com/robertskmiles/status/1663534255249453056) to hide the subversive message in the “plain English” statement.
This doesn’t account for the possibility that there’s still stenography involved. Plain English coming from an LLM may not be so plain given
Considering current language models are able to create their own “language” to communicate with each other without context (hit or miss, admittedly), who’s to say a deceptive model could find a way to hide misaligned thoughts in human language, like puzzles that spell a message using the first letter of every fourth word in a sentence? There could be some arbitrarily complicated algorithm (i.e., https://twitter.com/robertskmiles/status/1663534255249453056) to hide the subversive message in the “plain English” statement.