That is indeed a lot of points. Let me try to parse them and respond, because I think this discussion is critically important.
Point 1: overhang.
Your first two paragraphs seem to be pointing to downsides of progress, and saying that it would be better if nobody made that progress. I agree. We don’t have guaranteed methods of alignment, and I think our odds of survival would be much better if everyone went way slower on developing AGI.
The standard thinking, which could use more inspection, but which I agree with, is that this is simply not going to happen. Individuals that decide to step aside are slowing progress only slightly. This leaves compute overhang that someone else is going to take advantage of, with nearly the competence, and only slightly slower. Those individuals who pick up the banner and create AGI will not be infinitely reckless, but the faster progress from that overhang will make whatever level of caution they have less effective.
This is a separate argument from regulation. Adequate regulation will slow progress universally, rather than leaving it up to the wisdom and conscience of every individual who might decide to develop AGI.
I don’t think it’s impossible to slow and meter progress so that overhang isn’t an issue. But I think it is effectively even harder than alignment. We have decent suggestions on the table for alignment now, and as far as I know, no equally promising suggestions for getting everyone (and it does take almost everyone coordinating) to pass up the immense opportunities offered by capabilities overhangs.
Point 2: Are LLMs safer than other approaches?
I agree that this is a questionable proposition. I think it’s worth questioning. Aiming progress at easier-to-align approaches seems highly worthwhile.
I agree that an LLM may have something like a mind inside. I think current versions are almost certainly too dumb to be existentially dangerous (at least directly—if a facebook algorithm can nearly cause an insurrection, who knows what dangerous side effects any AI can have).
I’m less worried about GPT10 playing a superintelligent, Waluigi-collapsed villain than I am about a GPT6 that has been amplified to agency, situational awareness, and weak superintelligence by scaffolding it into something like a cognitive architecture. I think this type of advance is inevitable. ChatGPT extensions and Bing Chat both use internal prompting to boost intelligence, and approaches like SmartGPT and Tree of Thoughts massively improve benchmark results over the base LLM.
Fortunately, this direction also has huge advantages for alignment. It has a very low alignment tax, since you give them additional goals in natural language, like “support human empowerment” or whatever the SOTA alignment goal is. And they have vastly better interpretability since they’re at least summarizing their thoughts in natural language.
Here’s where your skepticism that they’re being honest about summarizing those thoughts comes into full force. I agree that it’s not reliable; for instance, changing the intermediate answer in chain of thought prompting often doesn’t change the final output, indicating that that output was for show.
However, a safer setup is to never use the same model twice. When you use chain-of-thought reasoning, construct a new context with the relevant information from memory; don’t just let the context window accrue, since this allows fake chains-of-thought and the collapse of the simulator into a waluigi.
Scaffolded LLMs should not turn an LLM into an agent, but rather create a committee of LLMs that are called for individual questions needed to accomplish that committee’s goals.
This isn’t remotely a solution to the alignment problem, but it really seems to have massive upsides, and only the same downsides as other practically viable approaches to AGI.
To be clear, I only see some form of RL agents as the other practical possibility, and I like our odds much less with those.
I think there are other, even more readily alignable approaches to AGI. But they all seem wildly impractical. I think we need to get ready to align the AGI we get, rather than just preparing to say I-told-you-so after the world refuses to forego massive incentives to take a much slower but safer route to AGI.
To paraphrase, we need to go to the alignment war with the AGI we get, not the AGI we want.
That is indeed a lot of points. Let me try to parse them and respond, because I think this discussion is critically important.
Point 1: overhang.
Your first two paragraphs seem to be pointing to downsides of progress, and saying that it would be better if nobody made that progress. I agree. We don’t have guaranteed methods of alignment, and I think our odds of survival would be much better if everyone went way slower on developing AGI.
The standard thinking, which could use more inspection, but which I agree with, is that this is simply not going to happen. Individuals that decide to step aside are slowing progress only slightly. This leaves compute overhang that someone else is going to take advantage of, with nearly the competence, and only slightly slower. Those individuals who pick up the banner and create AGI will not be infinitely reckless, but the faster progress from that overhang will make whatever level of caution they have less effective.
This is a separate argument from regulation. Adequate regulation will slow progress universally, rather than leaving it up to the wisdom and conscience of every individual who might decide to develop AGI.
I don’t think it’s impossible to slow and meter progress so that overhang isn’t an issue. But I think it is effectively even harder than alignment. We have decent suggestions on the table for alignment now, and as far as I know, no equally promising suggestions for getting everyone (and it does take almost everyone coordinating) to pass up the immense opportunities offered by capabilities overhangs.
Point 2: Are LLMs safer than other approaches?
I agree that this is a questionable proposition. I think it’s worth questioning. Aiming progress at easier-to-align approaches seems highly worthwhile.
I agree that an LLM may have something like a mind inside. I think current versions are almost certainly too dumb to be existentially dangerous (at least directly—if a facebook algorithm can nearly cause an insurrection, who knows what dangerous side effects any AI can have).
I’m less worried about GPT10 playing a superintelligent, Waluigi-collapsed villain than I am about a GPT6 that has been amplified to agency, situational awareness, and weak superintelligence by scaffolding it into something like a cognitive architecture. I think this type of advance is inevitable. ChatGPT extensions and Bing Chat both use internal prompting to boost intelligence, and approaches like SmartGPT and Tree of Thoughts massively improve benchmark results over the base LLM.
Fortunately, this direction also has huge advantages for alignment. It has a very low alignment tax, since you give them additional goals in natural language, like “support human empowerment” or whatever the SOTA alignment goal is. And they have vastly better interpretability since they’re at least summarizing their thoughts in natural language.
Here’s where your skepticism that they’re being honest about summarizing those thoughts comes into full force. I agree that it’s not reliable; for instance, changing the intermediate answer in chain of thought prompting often doesn’t change the final output, indicating that that output was for show.
However, a safer setup is to never use the same model twice. When you use chain-of-thought reasoning, construct a new context with the relevant information from memory; don’t just let the context window accrue, since this allows fake chains-of-thought and the collapse of the simulator into a waluigi.
Scaffolded LLMs should not turn an LLM into an agent, but rather create a committee of LLMs that are called for individual questions needed to accomplish that committee’s goals.
This isn’t remotely a solution to the alignment problem, but it really seems to have massive upsides, and only the same downsides as other practically viable approaches to AGI.
To be clear, I only see some form of RL agents as the other practical possibility, and I like our odds much less with those.
I think there are other, even more readily alignable approaches to AGI. But they all seem wildly impractical. I think we need to get ready to align the AGI we get, rather than just preparing to say I-told-you-so after the world refuses to forego massive incentives to take a much slower but safer route to AGI.
To paraphrase, we need to go to the alignment war with the AGI we get, not the AGI we want.