Could someone kindly explain why these two sentences are not contradictory?
“If a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months.”
2.”There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it.”
Why doesn’t it work to make an unaligned AGI that writes the textbook, then have some humans read and understand the simple robust ideas, and then build a new aligned AGI with those ideas? If the ideas are actually simple and robust, it should be impossible to tunnel a harmful AGI through them, right? If the unaligned AI refuses to write the textbook, couldn’t we just delete it and try again? Or is the claim that we wouldn’t be able to distinguish between this textbook and a piece of world-ending mind-control? (Apologies if this is answered elsewhere in the literature; it’s hard to search.)
Imagine you have to defuse a bomb, and you know nothing about bombs, and someone tells you “cut the red one, then blue, then yellow, then green”. If this really is a way to defuse a bomb, it is simple and robust. But you (since you have no knowledge about bombs) can’t check it, you can only take it on faith (and if you tried it and it’s not the right way—you’re dead).
But we can refuse to be satisified with instructions that look like “cut the red one, then blue, etc...”. We should request that the AI writing the textbook explain from first principles why that will work, in a way that is maximally comprehensible by a human or team of humans.
I think it’s the last thing you said. I think the claim is that there are very convincing possible fake textbooks, such that we wouldn’t be able to see anything wrong or fishy about the fake textbook just by reading it, but if we used the fake textbook to build an AGI then we would die.
What Steven Byrnes said, but also my reading is that 1) in the current paradigm it’s near-damn-impossible to built such an AI without creating an unaligned AI in the process (how else do you gradient-descend your way into a book on aligned AIs?) and 2) if you do make an unaligned AI powerful enough to write such a textbook, it’ll probably proceed to converting the entire mass of the universe into textbooks, or do something similarly incompatible with human life.
Could someone kindly explain why these two sentences are not contradictory?
“If a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months.” 2.”There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it.”
Why doesn’t it work to make an unaligned AGI that writes the textbook, then have some humans read and understand the simple robust ideas, and then build a new aligned AGI with those ideas? If the ideas are actually simple and robust, it should be impossible to tunnel a harmful AGI through them, right? If the unaligned AI refuses to write the textbook, couldn’t we just delete it and try again? Or is the claim that we wouldn’t be able to distinguish between this textbook and a piece of world-ending mind-control? (Apologies if this is answered elsewhere in the literature; it’s hard to search.)
simple and robust != checkable
Imagine you have to defuse a bomb, and you know nothing about bombs, and someone tells you “cut the red one, then blue, then yellow, then green”. If this really is a way to defuse a bomb, it is simple and robust. But you (since you have no knowledge about bombs) can’t check it, you can only take it on faith (and if you tried it and it’s not the right way—you’re dead).
But we can refuse to be satisified with instructions that look like “cut the red one, then blue, etc...”. We should request that the AI writing the textbook explain from first principles why that will work, in a way that is maximally comprehensible by a human or team of humans.
Did you mean “in a way that maximally convinces a human or a team of humans that they understand everything”? I don’t think this is a good idea.
“Cut the red wire” is not an instruction that you would find in a textbook on bomb defusal, precisely because it is not robust.
I’m not sure I understand correctly what you mean by “robust”. Can you elaborate?
I think it’s the last thing you said. I think the claim is that there are very convincing possible fake textbooks, such that we wouldn’t be able to see anything wrong or fishy about the fake textbook just by reading it, but if we used the fake textbook to build an AGI then we would die.
What Steven Byrnes said, but also my reading is that 1) in the current paradigm it’s near-damn-impossible to built such an AI without creating an unaligned AI in the process (how else do you gradient-descend your way into a book on aligned AIs?) and 2) if you do make an unaligned AI powerful enough to write such a textbook, it’ll probably proceed to converting the entire mass of the universe into textbooks, or do something similarly incompatible with human life.