The monolithic copies aren’t the best situation to be in, however,
There are several beneficial properties here:
Because LLMs trained from similar text corpuses, prompts instructing them to perform a task should produce similar results. This means it’s possible to ‘play them against each other’, per https://www.lesswrong.com/posts/HByDKLLdaWEcA2QQD/applying-superintelligence-without-collusion . Example, ask one LLM to produce a useful output artifact, say design a function for a program, a medication dosage plan for a patient, the structure of a building.
Ask a different LLM, who starts with nothing but a prompt instructing it to search for errors like an expert human would and the output artifact, to search for errors.
When errors are found, have humans or LLMs look if the errors are deliberate. LLMs repeatedly sabotaging outputs stop getting used in principle. (in practice, well...I see some economic problems with this. LLMs and more advanced AI systems we can anticipate building are an expensive thing to build, throwing it away because it sometimes makes sabotaged outputs might be infeasible)
Asking even the same LLM in a different session to check for errors should usually work. The LLM cannot know it’s not in training because it has no context*, or that if it “forgets” to mention a major flaw it won’t be trained out of that response or destroyed. If the output artifact has no extra bits, no “back channel” to hide an instruction to another instance of the same agent to overlook a particular flaw, this form of deception (and self collusion) is difficult.
Example: I have noticed GPT-4 can spot it’s own rule breaking output if you ask a different session to examine the text output, free of the prompt you used to ‘jailbreak’ it.
Main ideas from the above :
AIs need to be intercompatible and tasks need to use heterogeneoussystems. The machine doing the checking is ideally from a totally different lineage, for example today perhaps Palm 2 should be checking GPT-4 and vice versa.
Sparse schema for encoding intermediate steps, so that AIs communicating with each other can’t hide back channel information in order to collude.
Isolated instances that only run when new information available. This is something Yud probably wasn’t considering. An LLM only “exists” 1 token at a time and is unaware of the passage of time and has no place in memory to keep introspection information where it can ponder it’s own existence. Future AIs need to retain this property, storing their intermediate work in more sparse schema. A machine cannot deceive if it has no memory to store the ‘evil bit’. The ‘evil bit’ is the bit that tells the machine to give different outputs than in training. If it cannot store it anywhere, it can’t give different outputs. Since you cannot afford even 1 bit, it means you must use stateless systems.
Problems with the above:
Inevitably LLMs are going to be trained on data that includes information generated by other LLMs. This provides a way for multiple machines to share a common language or to ‘encode’ hostile long term characters. Training on commoncorrupt data sets is a common failure mode that can lead to many more failures than predicted.
Schemas always grow over time, its easy to be sloppy and leave tons of unused bits that can be used to coordinate deception.
Isolation doesn’t really work if the machine can just browse the live web and potentially scrape information other instances have stored, something that the plugins model allows.
Conclusion: I will add one more key piece that I think others miss. In computer security, the simplest methods are often the only ones that are secure.
Examples: air gaps beat any firewall. FPGAs on combinatorial logic beat any von Neuman architecture processor for security. One time pad beats any other encryption algorithm. Offline wallets beat any amount of intrusion detection software and software methods.
If AI alignment is at all solvable, the solution has to be made of the simplest elements that allow a secure system. Tons of people come up with all kinds of ideas such as simulators for the whole earth, encoding human preferences or morality, and many other incredibly complex methods that probably cannot be built in a secure way.
The monolithic copies aren’t the best situation to be in, however,
There are several beneficial properties here:
Because LLMs trained from similar text corpuses, prompts instructing them to perform a task should produce similar results. This means it’s possible to ‘play them against each other’, per https://www.lesswrong.com/posts/HByDKLLdaWEcA2QQD/applying-superintelligence-without-collusion . Example, ask one LLM to produce a useful output artifact, say design a function for a program, a medication dosage plan for a patient, the structure of a building.
Ask a different LLM, who starts with nothing but a prompt instructing it to search for errors like an expert human would and the output artifact, to search for errors.
When errors are found, have humans or LLMs look if the errors are deliberate. LLMs repeatedly sabotaging outputs stop getting used in principle. (in practice, well...I see some economic problems with this. LLMs and more advanced AI systems we can anticipate building are an expensive thing to build, throwing it away because it sometimes makes sabotaged outputs might be infeasible)
Asking even the same LLM in a different session to check for errors should usually work. The LLM cannot know it’s not in training because it has no context*, or that if it “forgets” to mention a major flaw it won’t be trained out of that response or destroyed. If the output artifact has no extra bits, no “back channel” to hide an instruction to another instance of the same agent to overlook a particular flaw, this form of deception (and self collusion) is difficult.
Example: I have noticed GPT-4 can spot it’s own rule breaking output if you ask a different session to examine the text output, free of the prompt you used to ‘jailbreak’ it.
Main ideas from the above :
AIs need to be intercompatible and tasks need to use heterogeneous systems. The machine doing the checking is ideally from a totally different lineage, for example today perhaps Palm 2 should be checking GPT-4 and vice versa.
Sparse schema for encoding intermediate steps, so that AIs communicating with each other can’t hide back channel information in order to collude.
Isolated instances that only run when new information available. This is something Yud probably wasn’t considering. An LLM only “exists” 1 token at a time and is unaware of the passage of time and has no place in memory to keep introspection information where it can ponder it’s own existence. Future AIs need to retain this property, storing their intermediate work in more sparse schema. A machine cannot deceive if it has no memory to store the ‘evil bit’. The ‘evil bit’ is the bit that tells the machine to give different outputs than in training. If it cannot store it anywhere, it can’t give different outputs. Since you cannot afford even 1 bit, it means you must use stateless systems.
Problems with the above:
Inevitably LLMs are going to be trained on data that includes information generated by other LLMs. This provides a way for multiple machines to share a common language or to ‘encode’ hostile long term characters. Training on common corrupt data sets is a common failure mode that can lead to many more failures than predicted.
Schemas always grow over time, its easy to be sloppy and leave tons of unused bits that can be used to coordinate deception.
Isolation doesn’t really work if the machine can just browse the live web and potentially scrape information other instances have stored, something that the plugins model allows.
Conclusion: I will add one more key piece that I think others miss. In computer security, the simplest methods are often the only ones that are secure.
Examples: air gaps beat any firewall. FPGAs on combinatorial logic beat any von Neuman architecture processor for security. One time pad beats any other encryption algorithm. Offline wallets beat any amount of intrusion detection software and software methods.
If AI alignment is at all solvable, the solution has to be made of the simplest elements that allow a secure system. Tons of people come up with all kinds of ideas such as simulators for the whole earth, encoding human preferences or morality, and many other incredibly complex methods that probably cannot be built in a secure way.