I’d say that there’s a big difference between fooling you into “brilliant that really looks plausible” shortly after you read it, and a group of smart humans trying to deeply understand the concepts for months and trying to make really sure there are no loopholes.
In fact, I’d expect the making us wrongly believe strongly everything works after months is impossible even in the limit of superintelligence, though I do think the superintelligence could prompt some text that destroys/shapes the world as it’d like.
And generally something smart enough to solve alignment will likely be smart enough to break out of the box and take over the world, as said.
But yeah if the people with the AGI aren’t extremely cautions and just go ahead and quickly build AGI because it looks all correct, then that might go badly. But my point was that it is in the reach of human checkability.
I’d say that there’s a big difference between fooling you into “brilliant that really looks plausible” shortly after you read it, and a group of smart humans trying to deeply understand the concepts for months and trying to make really sure there are no loopholes. In fact, I’d expect the making us wrongly believe strongly everything works after months is impossible even in the limit of superintelligence, though I do think the superintelligence could prompt some text that destroys/shapes the world as it’d like. And generally something smart enough to solve alignment will likely be smart enough to break out of the box and take over the world, as said.
But yeah if the people with the AGI aren’t extremely cautions and just go ahead and quickly build AGI because it looks all correct, then that might go badly. But my point was that it is in the reach of human checkability.