researchers at big labs will not be forced to program an ASI to do bad things against the researchers’ own will
Well these systems aren’t programmed. Researchers work on architecture and engineering, goal content is down to the RLHF that is applied and the wishes of the user(s), and the wishes of the user(s) are determined by market forces, user preferences, etc. And user preferences may themselves be influenced by other AI systems.
Closed source models can have RLHF and be delivered via an API, but open source models will not be far behind at any given point in time. And of course prompt injection attacks can bypass the RLHF on even closed source models.
The decisions about what RLHF to apply on contentious topics will come from politicians and from the leadership of the companies, not from the researchers. And politicians are influenced by the media and elections, and company leadership is influenced by the market and by cultural trends.
Where does the chain of control ultimately ground itself?
Answer: it doesn’t. Control of AI in the current paradigm is floating. Various players can influence it, but there’s no single source of truth for “what’s the AI’s goal”.
I don’t dispute any of that, but I also don’t think RLHF is a workable method for building or aligning a powerful AGI.
Zooming out, my original point was that there are two problems humanity is facing, quite different in character but both very difficult:
a coordination / governance problem, around deciding when to build AGI and who gets to build it
a technical problem, around figuring out how to build an AGI that does what the builder wants at all.
My view is that we are currently on track to solve neither of those problems. But if you actually consider what the world in which we sufficiently-completely solve even of them looks like, it seems like either is sufficient for a relatively high probability of a relatively good outcome, compared to where we are now.
Both possible worlds are probably weird hypotheticals which shouldn’t have an impact on what our actual strategy in the world we actually live in should be, which is of course to pursue solutions to both problems simultaneously with as much vigor as possible. But it still seems worth keeping in mind that if even one thing works out sufficiently well, we probably won’t be totally doomed.
I think the theory is something like the following: We build the guaranteed trustworthy AI, and ask it to prevent the creation of unaligned AI, and it comes up with the necessary governance structures, and the persuasion and force needed to implement them.
I’m not sure this is a certain argument. Some political actions are simply impossible to accomplish ethically, and therefore unavailable to a “good” actor even given superhuman abilities.
Well these systems aren’t programmed. Researchers work on architecture and engineering, goal content is down to the RLHF that is applied and the wishes of the user(s), and the wishes of the user(s) are determined by market forces, user preferences, etc. And user preferences may themselves be influenced by other AI systems.
Closed source models can have RLHF and be delivered via an API, but open source models will not be far behind at any given point in time. And of course prompt injection attacks can bypass the RLHF on even closed source models.
The decisions about what RLHF to apply on contentious topics will come from politicians and from the leadership of the companies, not from the researchers. And politicians are influenced by the media and elections, and company leadership is influenced by the market and by cultural trends.
Where does the chain of control ultimately ground itself?
Answer: it doesn’t. Control of AI in the current paradigm is floating. Various players can influence it, but there’s no single source of truth for “what’s the AI’s goal”.
I don’t dispute any of that, but I also don’t think RLHF is a workable method for building or aligning a powerful AGI.
Zooming out, my original point was that there are two problems humanity is facing, quite different in character but both very difficult:
a coordination / governance problem, around deciding when to build AGI and who gets to build it
a technical problem, around figuring out how to build an AGI that does what the builder wants at all.
My view is that we are currently on track to solve neither of those problems. But if you actually consider what the world in which we sufficiently-completely solve even of them looks like, it seems like either is sufficient for a relatively high probability of a relatively good outcome, compared to where we are now.
Both possible worlds are probably weird hypotheticals which shouldn’t have an impact on what our actual strategy in the world we actually live in should be, which is of course to pursue solutions to both problems simultaneously with as much vigor as possible. But it still seems worth keeping in mind that if even one thing works out sufficiently well, we probably won’t be totally doomed.
How does a solution to the above solve the coordination/governance problem?
I think the theory is something like the following: We build the guaranteed trustworthy AI, and ask it to prevent the creation of unaligned AI, and it comes up with the necessary governance structures, and the persuasion and force needed to implement them.
I’m not sure this is a certain argument. Some political actions are simply impossible to accomplish ethically, and therefore unavailable to a “good” actor even given superhuman abilities.