Chiming in on toy models of research incentives: Seems to me like a key feature is that you start with an Arms Race then, after some amount of capabilities accumulate, transitions to the Suicide Race. But players have only vague estimates of where that threshold is, have widely varying estimates, and may not be able to communicate estimates effectively or persuasively. Players have a strong incentive to push right up to the line where things get obviously (to them) dangerous, and with enough players, somebody’s estimate is going to be wrong.
Working off a model like that, we’d much rather be playing the version where players can effectively share estimates and converge on a view of what level of capabilities makes things get very dangerous. Lack of constructive conversations with the largest players on that topic do sound like a current bottleneck.
It’s unclear to me to what extent there’s even a universal clear distinction understood between mundane weak AI systems with ordinary kinds of risks and superhuman AGI systems with exotic risks that software and business people aren’t used to thinking about outside of sci-fi. That strikes me as a key inferential leap that may be getting glossed over.
There’s quite a lot of effort spent in technology training people that systems are mostly static absent human intervention or well defined automations that some person ultimately wrote, anything else being a fault that gets fixed. Computers don’t have a mind of their own, troubleshoot instead of anthropomorphizing, etc., etc. That this intuition will at some point stop working or being true of a sufficiently capable system (and that this is fundamentally part of what we mean by human level AGI) is something that probably needs to be focused on more as it’s explicitly contrary to the basic induction that’s part of usefully working in/on computers.
Chiming in on toy models of research incentives: Seems to me like a key feature is that you start with an Arms Race then, after some amount of capabilities accumulate, transitions to the Suicide Race. But players have only vague estimates of where that threshold is, have widely varying estimates, and may not be able to communicate estimates effectively or persuasively. Players have a strong incentive to push right up to the line where things get obviously (to them) dangerous, and with enough players, somebody’s estimate is going to be wrong.
Working off a model like that, we’d much rather be playing the version where players can effectively share estimates and converge on a view of what level of capabilities makes things get very dangerous. Lack of constructive conversations with the largest players on that topic do sound like a current bottleneck.
It’s unclear to me to what extent there’s even a universal clear distinction understood between mundane weak AI systems with ordinary kinds of risks and superhuman AGI systems with exotic risks that software and business people aren’t used to thinking about outside of sci-fi. That strikes me as a key inferential leap that may be getting glossed over.
There’s quite a lot of effort spent in technology training people that systems are mostly static absent human intervention or well defined automations that some person ultimately wrote, anything else being a fault that gets fixed. Computers don’t have a mind of their own, troubleshoot instead of anthropomorphizing, etc., etc. That this intuition will at some point stop working or being true of a sufficiently capable system (and that this is fundamentally part of what we mean by human level AGI) is something that probably needs to be focused on more as it’s explicitly contrary to the basic induction that’s part of usefully working in/on computers.