if the AGI outputs the textbook of the future on alignment, I’d say we could understand that sufficiently well to be sure that our AI will be aligned/corrigible
It’s at least not obvious to me that we would be able to tell apart “the textbook of the future on alignment” from “an artifact that purports to be the textbook of the future on alignment, but is in fact a manipulative trick”. At least, I think we wouldn’t be able to tell them apart in the limit of superintelligent AI.
I mean, it’s not like I’ve never found an argument compelling and then realized much later on that it was wrong. Hasn’t everyone?
I’d say that there’s a big difference between fooling you into “brilliant that really looks plausible” shortly after you read it, and a group of smart humans trying to deeply understand the concepts for months and trying to make really sure there are no loopholes.
In fact, I’d expect the making us wrongly believe strongly everything works after months is impossible even in the limit of superintelligence, though I do think the superintelligence could prompt some text that destroys/shapes the world as it’d like.
And generally something smart enough to solve alignment will likely be smart enough to break out of the box and take over the world, as said.
But yeah if the people with the AGI aren’t extremely cautions and just go ahead and quickly build AGI because it looks all correct, then that might go badly. But my point was that it is in the reach of human checkability.
It’s at least not obvious to me that we would be able to tell apart “the textbook of the future on alignment” from “an artifact that purports to be the textbook of the future on alignment, but is in fact a manipulative trick”. At least, I think we wouldn’t be able to tell them apart in the limit of superintelligent AI.
I mean, it’s not like I’ve never found an argument compelling and then realized much later on that it was wrong. Hasn’t everyone?
I’d say that there’s a big difference between fooling you into “brilliant that really looks plausible” shortly after you read it, and a group of smart humans trying to deeply understand the concepts for months and trying to make really sure there are no loopholes. In fact, I’d expect the making us wrongly believe strongly everything works after months is impossible even in the limit of superintelligence, though I do think the superintelligence could prompt some text that destroys/shapes the world as it’d like. And generally something smart enough to solve alignment will likely be smart enough to break out of the box and take over the world, as said.
But yeah if the people with the AGI aren’t extremely cautions and just go ahead and quickly build AGI because it looks all correct, then that might go badly. But my point was that it is in the reach of human checkability.