I don’t dispute any of that, but I also don’t think RLHF is a workable method for building or aligning a powerful AGI.
Zooming out, my original point was that there are two problems humanity is facing, quite different in character but both very difficult:
a coordination / governance problem, around deciding when to build AGI and who gets to build it
a technical problem, around figuring out how to build an AGI that does what the builder wants at all.
My view is that we are currently on track to solve neither of those problems. But if you actually consider what the world in which we sufficiently-completely solve even of them looks like, it seems like either is sufficient for a relatively high probability of a relatively good outcome, compared to where we are now.
Both possible worlds are probably weird hypotheticals which shouldn’t have an impact on what our actual strategy in the world we actually live in should be, which is of course to pursue solutions to both problems simultaneously with as much vigor as possible. But it still seems worth keeping in mind that if even one thing works out sufficiently well, we probably won’t be totally doomed.
I think the theory is something like the following: We build the guaranteed trustworthy AI, and ask it to prevent the creation of unaligned AI, and it comes up with the necessary governance structures, and the persuasion and force needed to implement them.
I’m not sure this is a certain argument. Some political actions are simply impossible to accomplish ethically, and therefore unavailable to a “good” actor even given superhuman abilities.
I don’t dispute any of that, but I also don’t think RLHF is a workable method for building or aligning a powerful AGI.
Zooming out, my original point was that there are two problems humanity is facing, quite different in character but both very difficult:
a coordination / governance problem, around deciding when to build AGI and who gets to build it
a technical problem, around figuring out how to build an AGI that does what the builder wants at all.
My view is that we are currently on track to solve neither of those problems. But if you actually consider what the world in which we sufficiently-completely solve even of them looks like, it seems like either is sufficient for a relatively high probability of a relatively good outcome, compared to where we are now.
Both possible worlds are probably weird hypotheticals which shouldn’t have an impact on what our actual strategy in the world we actually live in should be, which is of course to pursue solutions to both problems simultaneously with as much vigor as possible. But it still seems worth keeping in mind that if even one thing works out sufficiently well, we probably won’t be totally doomed.
How does a solution to the above solve the coordination/governance problem?
I think the theory is something like the following: We build the guaranteed trustworthy AI, and ask it to prevent the creation of unaligned AI, and it comes up with the necessary governance structures, and the persuasion and force needed to implement them.
I’m not sure this is a certain argument. Some political actions are simply impossible to accomplish ethically, and therefore unavailable to a “good” actor even given superhuman abilities.