A common problem with deploying language models for high-stakes decision making are prompt-injections. If you give ChatGPT-4 access to your bank account information and your email and don’t give proper oversight over it, you can bet that somebody’s going to find a way to get it to email your bank account info. Some argue that if we can’t even trust these models to handle our bank account and email addresses, how are we going to be able to trust them to handle our universe.
A common problem with deploying language models for high-stakes decision making are prompt-injections. If you give ChatGPT-4 access to your bank account information and your email and don’t give proper oversight over it, you can bet that somebody’s going to find a way to get it to email your bank account info. Some argue that if we can’t even trust these models to handle our bank account and email addresses, how are we going to be able to trust them to handle our universe.
An approach I’ve currently started thinking about, and don’t know of any prior work with our advanced language models on: Using the security amplification (LessWrong version) properties of Christiano’s old Meta-execution (LessWrong version).