AI itself can take care of next steps, if it cares about alignment as much as we do
That’s where I put most of P(doom), that the first AGIs are loosely aligned but only care about alignment about as much as we do, and that Moloch holds enough sway with them to urge immediate development of more capable AGIs, using their current capabilities to do that faster and more recklessly than humans could, well before serious alignment security norms are in place.
There will be fewer first AGIs than there are human researchers, and they will be smarter than human researchers. So if they care about alignment as much as we do, that seems like good news—they’ll have an easier time coordinating and an easier time solving the problem. Or am I missing something?
Humans are exactly as smart as they have to be to build a technological civilization. First AGIs don’t need to be smarter than that to build dangerous successor AGIs, and they are already faster and more knowledgeable, so they might even get away with being less intelligent than the smartest human researchers. Unless of course agency lags behind intelligence, like it does behind encyclopedic knowledge, and there is an intelligence overhang where the first autonomously agentic systems happen to be significantly more intelligent than humans. But this is not obviously how this goes.
The number of diverse AGI instances might be easy to scale, like with the system message of GPT-4 where the model itself is fine-tuned not into adherence to a particular mask, but into being a mask generator that presents as any mask that is requested. And it’s not just the diverse AGIs that need to coordinate on alignment security, but also human users who prompt steerable AGIs. It’s a greater feat than building new AGIs, then as it is now. At near-human level, I don’t see how that state of affairs changes, and you don’t need to get far from human level to build more dangerous AGIs.
It seems to me that agency does lag behind extrapolation capability. I can think of two reasons for that. First, extrapolation gets more investment. Second, agency might require a lot of training in the real world, which is slow, while extrapolation can be trained on datasets from the internet. If someone invents a way to train agency on datasets from the internet, or something like AlphaZero’s self-play, in a way that carries over to the real world, I’ll be pretty scared, but so far it hasn’t happened afaik.
If the above is right, then maybe the first agent AIs will be few in number, because they’ll have an incentive to stop other agent AIs from coming into existence and will be smart enough to do so, e.g. by taking over the internet or manipulating people.
Extrapolation capability is wielded by shoggoths and makes masks possible, but it’s not wielded by the masks themselves. Like humans can’t predict next tokens given a prompt (to the extent similar to how well LLMs can), neither can LLM characters (they can’t disregard the rest of the context outside the target prompt to access their “inner shoggoth”, let alone make use of that capability level for something more useful). So agency in masks doesn’t automatically take advantage of extrapolation capability in shoggoths, doesn’t turn masks superintelligent from merely becoming agentic. This creates the danger of only slightly superhuman AGIs that immediately muck up alignment security, once LLM masks do get to autonomous agency (which I’m almost certain they will eventually, unless something else happens first).
It’s only shoggoths themselves waking up (learning to use situationally aware deliberation within the residual stream rather than context window) that makes an immediate qualitative capability discontinuity more likely (for LLMs). Looking at GPT-4 capability to solve complicated tasks without thinking out loud in tokens, I suspect that merely a slightly different SSL schedule with a sufficiently giant LLM might trigger that. Hence recently I’m operating under one year AGI timelines lower bound (lower 25% quantile), until the literature implies a negative result for that experiment (with GPT-4 level scale being necessary, this might take a while). This outcome both reduces the chances of direct alignment and increases the chances that alignment security gets sorted.
That’s where I put most of P(doom), that the first AGIs are loosely aligned but only care about alignment about as much as we do, and that Moloch holds enough sway with them to urge immediate development of more capable AGIs, using their current capabilities to do that faster and more recklessly than humans could, well before serious alignment security norms are in place.
There will be fewer first AGIs than there are human researchers, and they will be smarter than human researchers. So if they care about alignment as much as we do, that seems like good news—they’ll have an easier time coordinating and an easier time solving the problem. Or am I missing something?
Humans are exactly as smart as they have to be to build a technological civilization. First AGIs don’t need to be smarter than that to build dangerous successor AGIs, and they are already faster and more knowledgeable, so they might even get away with being less intelligent than the smartest human researchers. Unless of course agency lags behind intelligence, like it does behind encyclopedic knowledge, and there is an intelligence overhang where the first autonomously agentic systems happen to be significantly more intelligent than humans. But this is not obviously how this goes.
The number of diverse AGI instances might be easy to scale, like with the system message of GPT-4 where the model itself is fine-tuned not into adherence to a particular mask, but into being a mask generator that presents as any mask that is requested. And it’s not just the diverse AGIs that need to coordinate on alignment security, but also human users who prompt steerable AGIs. It’s a greater feat than building new AGIs, then as it is now. At near-human level, I don’t see how that state of affairs changes, and you don’t need to get far from human level to build more dangerous AGIs.
It seems to me that agency does lag behind extrapolation capability. I can think of two reasons for that. First, extrapolation gets more investment. Second, agency might require a lot of training in the real world, which is slow, while extrapolation can be trained on datasets from the internet. If someone invents a way to train agency on datasets from the internet, or something like AlphaZero’s self-play, in a way that carries over to the real world, I’ll be pretty scared, but so far it hasn’t happened afaik.
If the above is right, then maybe the first agent AIs will be few in number, because they’ll have an incentive to stop other agent AIs from coming into existence and will be smart enough to do so, e.g. by taking over the internet or manipulating people.
Extrapolation capability is wielded by shoggoths and makes masks possible, but it’s not wielded by the masks themselves. Like humans can’t predict next tokens given a prompt (to the extent similar to how well LLMs can), neither can LLM characters (they can’t disregard the rest of the context outside the target prompt to access their “inner shoggoth”, let alone make use of that capability level for something more useful). So agency in masks doesn’t automatically take advantage of extrapolation capability in shoggoths, doesn’t turn masks superintelligent from merely becoming agentic. This creates the danger of only slightly superhuman AGIs that immediately muck up alignment security, once LLM masks do get to autonomous agency (which I’m almost certain they will eventually, unless something else happens first).
It’s only shoggoths themselves waking up (learning to use situationally aware deliberation within the residual stream rather than context window) that makes an immediate qualitative capability discontinuity more likely (for LLMs). Looking at GPT-4 capability to solve complicated tasks without thinking out loud in tokens, I suspect that merely a slightly different SSL schedule with a sufficiently giant LLM might trigger that. Hence recently I’m operating under one year AGI timelines lower bound (lower 25% quantile), until the literature implies a negative result for that experiment (with GPT-4 level scale being necessary, this might take a while). This outcome both reduces the chances of direct alignment and increases the chances that alignment security gets sorted.