I’m uncertain about all this, but here are some quick takes.
With respect to technical intent alignment, I think we’re very lucky that a lot of safety research will probably be automatable by non-x-risky systems (sub-ASL-3, very unlikely to be scheming because bad at prerequisites like situational awareness, often surprisingly transparent because they use CoT, tools, etc.). So I think we could be in a really good position, if we actually tried hard to use such systems for automated safety research, (for now it doesn’t seem to me like we’re trying all that hard as a community).
I’m even more uncertain about the governance side, especially about what should be done. I think open-weights LMs widely distributed intelligence explosions are probably really bad, so hopefully at least very powerful LMs don’t get open-weighted. Beyond this, though, I’m more unsure about more multipolar vs. more unipolar scenarios, given e.g. the potential lack of robustness of single points of failure. I’m somewhat hopeful that nation-level actors impose enough constraints/regulation at the national level, and then something like https://aiprospects.substack.com/p/paretotopian-goal-alignment happens at the international level. We might also just get somewhat lucky that compute constraints + economic and security incentives might mean that there never are more than e.g. 20 actors with ( at least direct, e.g. weights) access to very strong superintelligence.
I feel like 20-100 actors feels like a reasonable amount to coordinate on treaties. I think 300 starts to worry me that there’d be some crazy defector in the mix that takes risks which destroy themselves and everyone else. Just 2 or 3 actors, and I worry that there will be weird competitive tensions that make it hard to come to a settlement. I dunno, maybe I’m wrong about that, but it’s how I feel.
I’ve been writing a bit about some ideas around trying to establish a ‘council of guardians’, who are the major powers in the world. They would agree to mutual inspections and to collaborate on stopping the unauthorized development of rogue AI and self-replicating weapons.
I’ve been thinking about similar international solutions, so I look forward to seeing your thoughts on the matter.
My major concern is sociopathic people gaining the reins of power of just one of those AGIs, and defecting against that council of guardians. I think sociopaths are greatly overrepresented among powerful people; they care less about the downsides of having and pursuing power aggressively.
That’s why I’d think even 20 RSI-capable human-directed AGIs wouldn’t be stable for more than a decade.
Yeah, I see it as sort of a temporary transitional mode for humanity. I also don’t think it would be stable for long. I might give it 20-30 years, but I would be skeptical about it holding for 50 years.
I do think that even 10 years more to work on more fundamental solutions to the AGI transition would be hugely valuable though!
I have been attempting at least to imagine how to design a system assuming that all the actors will be selfish and tempted to defect (and possibly sociopathic, as power-holders sometimes are or become), but prevented from breaking the system. Defection-resistant mechanisms, where you just need a majority of the council to not defect in a given ‘event’ in order for them to halt and punish the defectors. And that hopefully making it obvious that this was the case, and that defection would get noticed and punished, would prevent even sociopathic power-holders from defecting.
This seems possible to accomplish, if the system is designed such that catching and punishing an attempt at defection has benefits for the enforcers which give higher expected value in their minds than the option of deciding to also defect once they detected someone else defecting.
Seems like a good problem to largely defer to AI though (especially if we’re assuming alignment in the instruction following sense), so maybe not the most pressing.
Unless there’s important factors about ‘order of operations’. By the time we have a powerful enough AI to solve this for us, it could be that someone is already defecting by using that AI to pursue recursive self-improvement at top speed…
I think that that is probably the case. I think we need to get the Council of Guardians in place and preventing defection before it’s too late, and irreversibly bad defection has already occurred.
I am unsure of exactly where the thresholds are, but I am confident that nobody else should be confident that there aren’t any risks! Our uncertainty should cause us to err on the side of putting in safe governance mechanisms ASAP!
I’m uncertain about all this, but here are some quick takes.
With respect to technical intent alignment, I think we’re very lucky that a lot of safety research will probably be automatable by non-x-risky systems (sub-ASL-3, very unlikely to be scheming because bad at prerequisites like situational awareness, often surprisingly transparent because they use CoT, tools, etc.). So I think we could be in a really good position, if we actually tried hard to use such systems for automated safety research, (for now it doesn’t seem to me like we’re trying all that hard as a community).
I’m even more uncertain about the governance side, especially about what should be done. I think open-weights LMs widely distributed intelligence explosions are probably really bad, so hopefully at least very powerful LMs don’t get open-weighted. Beyond this, though, I’m more unsure about more multipolar vs. more unipolar scenarios, given e.g. the potential lack of robustness of single points of failure. I’m somewhat hopeful that nation-level actors impose enough constraints/regulation at the national level, and then something like https://aiprospects.substack.com/p/paretotopian-goal-alignment happens at the international level. We might also just get somewhat lucky that compute constraints + economic and security incentives might mean that there never are more than e.g. 20 actors with ( at least direct, e.g. weights) access to very strong superintelligence.
I feel like 20-100 actors feels like a reasonable amount to coordinate on treaties. I think 300 starts to worry me that there’d be some crazy defector in the mix that takes risks which destroy themselves and everyone else. Just 2 or 3 actors, and I worry that there will be weird competitive tensions that make it hard to come to a settlement. I dunno, maybe I’m wrong about that, but it’s how I feel.
I’ve been writing a bit about some ideas around trying to establish a ‘council of guardians’, who are the major powers in the world. They would agree to mutual inspections and to collaborate on stopping the unauthorized development of rogue AI and self-replicating weapons.
I’ve been thinking about similar international solutions, so I look forward to seeing your thoughts on the matter.
My major concern is sociopathic people gaining the reins of power of just one of those AGIs, and defecting against that council of guardians. I think sociopaths are greatly overrepresented among powerful people; they care less about the downsides of having and pursuing power aggressively.
That’s why I’d think even 20 RSI-capable human-directed AGIs wouldn’t be stable for more than a decade.
Yeah, I see it as sort of a temporary transitional mode for humanity. I also don’t think it would be stable for long. I might give it 20-30 years, but I would be skeptical about it holding for 50 years.
I do think that even 10 years more to work on more fundamental solutions to the AGI transition would be hugely valuable though!
I have been attempting at least to imagine how to design a system assuming that all the actors will be selfish and tempted to defect (and possibly sociopathic, as power-holders sometimes are or become), but prevented from breaking the system. Defection-resistant mechanisms, where you just need a majority of the council to not defect in a given ‘event’ in order for them to halt and punish the defectors. And that hopefully making it obvious that this was the case, and that defection would get noticed and punished, would prevent even sociopathic power-holders from defecting.
This seems possible to accomplish, if the system is designed such that catching and punishing an attempt at defection has benefits for the enforcers which give higher expected value in their minds than the option of deciding to also defect once they detected someone else defecting.
Seems like a good problem to largely defer to AI though (especially if we’re assuming alignment in the instruction following sense), so maybe not the most pressing.
Unless there’s important factors about ‘order of operations’. By the time we have a powerful enough AI to solve this for us, it could be that someone is already defecting by using that AI to pursue recursive self-improvement at top speed…
I think that that is probably the case. I think we need to get the Council of Guardians in place and preventing defection before it’s too late, and irreversibly bad defection has already occurred.
I am unsure of exactly where the thresholds are, but I am confident that nobody else should be confident that there aren’t any risks! Our uncertainty should cause us to err on the side of putting in safe governance mechanisms ASAP!