Imagine you have to defuse a bomb, and you know nothing about bombs, and someone tells you “cut the red one, then blue, then yellow, then green”. If this really is a way to defuse a bomb, it is simple and robust. But you (since you have no knowledge about bombs) can’t check it, you can only take it on faith (and if you tried it and it’s not the right way—you’re dead).
But we can refuse to be satisified with instructions that look like “cut the red one, then blue, etc...”. We should request that the AI writing the textbook explain from first principles why that will work, in a way that is maximally comprehensible by a human or team of humans.
simple and robust != checkable
Imagine you have to defuse a bomb, and you know nothing about bombs, and someone tells you “cut the red one, then blue, then yellow, then green”. If this really is a way to defuse a bomb, it is simple and robust. But you (since you have no knowledge about bombs) can’t check it, you can only take it on faith (and if you tried it and it’s not the right way—you’re dead).
But we can refuse to be satisified with instructions that look like “cut the red one, then blue, etc...”. We should request that the AI writing the textbook explain from first principles why that will work, in a way that is maximally comprehensible by a human or team of humans.
Did you mean “in a way that maximally convinces a human or a team of humans that they understand everything”? I don’t think this is a good idea.
“Cut the red wire” is not an instruction that you would find in a textbook on bomb defusal, precisely because it is not robust.
I’m not sure I understand correctly what you mean by “robust”. Can you elaborate?