This sounds way too capable to be safe. Although someone is probably working on this right now, this line of thought getting traction might increase the number of people doing it 10x. Maybe that’s good since GPT4 probably isn’t smart enough to kill us, even with an agent wrapper. It will just scare the pants off of us.
Aligning the wrapper is somewhat similar to my suggestion of aligning an RL critic network head, such as humans seem to use. Align the captain, not the crew. And let the captain use the crew’s smarts without giving them much say in what to do or how to update them.
It’d be interesting to figure out where the biggest danger in this setup is coming from. 1) Difficulty of aligning the wrapper 2) Wild behavior from the LLM 3) Something else. And whether there can be spot fixes for some of it.
This sounds way too capable to be safe. Although someone is probably working on this right now, this line of thought getting traction might increase the number of people doing it 10x. Maybe that’s good since GPT4 probably isn’t smart enough to kill us, even with an agent wrapper. It will just scare the pants off of us.
Aligning the wrapper is somewhat similar to my suggestion of aligning an RL critic network head, such as humans seem to use. Align the captain, not the crew. And let the captain use the crew’s smarts without giving them much say in what to do or how to update them.
It’d be interesting to figure out where the biggest danger in this setup is coming from. 1) Difficulty of aligning the wrapper 2) Wild behavior from the LLM 3) Something else. And whether there can be spot fixes for some of it.