My attempt at a summary: Lets fine-tune language models on stories of an AI Guardian which shuts down when it becomes super powerful. We’ll then get our LLM to role-play as such a character so it is amenable to shut down. Corrigibility solved. Outer alignment pretty much solved. Inner alignment unclear.
My comment is blunt, apologies.
I think this alignment plan is very unlikely to be useful. It feels similar to RLHF in that it centers around fine-tuning language models to better produce text humans like, but it is worse in that it is far less steerable (with RLFH you can have a really complex reward model, whereas with this you are aiming for just the Guardian role-play; probably some other ways it’s strictly worse).
Various other problems I see in this plan:
Relying on narrative stories to build the archetype for the Guardian is sure to lead to distributional shifts when your LLM becomes meaningfully situationally aware and realizes it is not if fact part of short fictional stories but a part of the real world.
Seems like this approach places some constraints on how we can use our AI system which are unfortunate and make it less useful. For example, it might be hard to get alignment research out of such a system if it is very worried about getting too powerful/influential and then having to shut down.
Relatedly, this doesn’t seem to actually make any progress on the stop button problem. If your AI is aware that it has a planned shut-off mechanism at some level X it probably just finds ways to avoid the specific trigger at X, or to self-modify.
My attempt at a summary: Lets fine-tune language models on stories of an AI Guardian which shuts down when it becomes super powerful. We’ll then get our LLM to role-play as such a character so it is amenable to shut down. Corrigibility solved. Outer alignment pretty much solved. Inner alignment unclear.
My comment is blunt, apologies.
I think this alignment plan is very unlikely to be useful. It feels similar to RLHF in that it centers around fine-tuning language models to better produce text humans like, but it is worse in that it is far less steerable (with RLFH you can have a really complex reward model, whereas with this you are aiming for just the Guardian role-play; probably some other ways it’s strictly worse).
Various other problems I see in this plan:
Relying on narrative stories to build the archetype for the Guardian is sure to lead to distributional shifts when your LLM becomes meaningfully situationally aware and realizes it is not if fact part of short fictional stories but a part of the real world.
Seems like this approach places some constraints on how we can use our AI system which are unfortunate and make it less useful. For example, it might be hard to get alignment research out of such a system if it is very worried about getting too powerful/influential and then having to shut down.
Relatedly, this doesn’t seem to actually make any progress on the stop button problem. If your AI is aware that it has a planned shut-off mechanism at some level X it probably just finds ways to avoid the specific trigger at X, or to self-modify.
Hello Aaron,
Sorry it took me time to reply but you might find it worthy to read my updated account of this approach linked below:
https://www.lesswrong.com/posts/pu6D2EdJiz2mmhxfB/gpt-2-shuts-down-itself-386-times-post-fine-tuning-with
I will answer your questions—if any in that post. Thank you.
Thank you.
No time to rest. I’m starting to build The Guardian version 002.