Thinking about this more, it seems like there are some key background assumptions that I’m missing.
Some assumptions that I often hear get presenting on this topic are things like: 1. “A misaligned AI will explicitly try to give us hard-to-find vulnerabilities, so verifying arbitrary statements from these AIs will be incredibly hard.” 2. “We need to generally have incredibly high assurances to build powerful systems that don’t kill us”.
My obvious counter-arguments would be: 1. Sure, but smart agents would have a reasonable prior that agents would be misaligned, and also, they would give these agents tasks that would be particularly easy to verify. Any action actually taken by a smart overseer, using information provided by another agent with a chance of being misaligned, M (known by the smart overseer), should be EV-positive in value. With some creativity, there’s likely a bunch of ways of structuring things (using systems likely not to be misaligned, using more verifiable questions), where many resulting actions will likely be heavily EV-positive.
2. “Again, my argument in (1). Second, we can build these systems gradually, and with a lot of help from people/AIs that won’t require such high assurances.” (This is similar to the HCH / oversight arguments)
Thinking about this more, it seems like there are some key background assumptions that I’m missing.
Some assumptions that I often hear get presenting on this topic are things like:
1. “A misaligned AI will explicitly try to give us hard-to-find vulnerabilities, so verifying arbitrary statements from these AIs will be incredibly hard.”
2. “We need to generally have incredibly high assurances to build powerful systems that don’t kill us”.
My obvious counter-arguments would be:
1. Sure, but smart agents would have a reasonable prior that agents would be misaligned, and also, they would give these agents tasks that would be particularly easy to verify. Any action actually taken by a smart overseer, using information provided by another agent with a chance of being misaligned, M (known by the smart overseer), should be EV-positive in value. With some creativity, there’s likely a bunch of ways of structuring things (using systems likely not to be misaligned, using more verifiable questions), where many resulting actions will likely be heavily EV-positive.
2. “Again, my argument in (1). Second, we can build these systems gradually, and with a lot of help from people/AIs that won’t require such high assurances.” (This is similar to the HCH / oversight arguments)