Thanks for this compendium, I quite enjoyed reading it. It also motivated me to read the “Narrow Path” soon.
I have a bunch of reactions/comments/questions at several places. I focus on the places that feel most “cruxy” to me. I formulate them without much hedging to facilitate a better discussion, though I feel quite uncertain about most things I write.
The part on extinction from AI seems badly argued to me. Is it fair to say that you mainly want to convey a basic intuition, with the hope that the readers will find extinction an “obvious” result?
To be clear: I think that for literal god-like AI, as described by you, an existential catastrophe is likely if we don’t solve a very hard case of alignment. For levels below (superintelligence, AGI), I become progressively more optimistic. Some of my hope comes from believing that humanity will eventually coordinate to not scale to god-like AI unless we have enormous assurances that alignment is solved; I think this is similar to your wish, but you hope that we already stop before even AGI is built.
When we zoom out from the individual to groups, up to the whole of humanity, the complexity of “finding what we want” explodes: when different cultures, different religions, different countries disagree about what they want on key questions like state interventionism, immigration, or what is moral, how can we resolve these into a fixed set of values? If there is a scientific answer to this problem, we have made little progress on it.
If we cannot find, build, and reconcile values that fit with what we want, we will lose control of the future to AI systems that ardently defend a shadow of what we actually care about.
This is a topic where I’m pretty confused, but I still try to formulate a counterposition: I think we can probably align AI systems to constitutions, which then makes it unnecessary to solve all value differences. Whenever someone uses the AI, the AI needs to act in accordance with the constitution, which already has mechanisms for how to resolve value conflicts.
Additionally, the constitution could have mechanisms for how to change the constitution itself, so that humanity and AI could co-evolve to better values over time.
Progress on our ability to predict the consequences of our actions requires better science in every technical field.
ELK might circumvent this issue: Just query an AI about its latent knowledge of future consequences of our actions.
Process design for alignment: [...]
This section seems quite interesting to me, but somewhat different from technical discussions of alignment I’m used to. It seems to me that this section is about problems similar to “intent alignment” or creating valid “training stories”, only that you want to define alignment as working correctly in the whole world, instead of just individual systems. Thus, the process design should also prevent problems like “multipolar failure” that might be overlooked by other paradigms. Is this a correct characterization?
Given that this section mainly operates at the level of analogies to politics, economics, and history, I think this section could profit from making stronger connections to AI itself.
Just as solving neuroscience would be insufficient to explain how a company works, even full interpretability of an LLM would be insufficient to explain most research efforts on the AI frontier.
That seems true, and it reminds me of deep deceptiveness, where an AI engages in deception without having any internal process that “looks like” deception.
The more powerful AI we have, the faster things will go. As AI systems improve and automate their own learning, AGI will be able to improve faster than our current research, and ASI will be able to improve faster than humanity can do science. The dynamics of intelligence growth means that it is possible for an ASI “about as smart as humanity” to move to “beyond all human scientific frontiers” on the order of weeks or months. While the change is most dramatic with more advanced systems, as soon as we have AGI we enter a world where things begin to move much quicker, forcing us to solve alignment much faster than in a pre-AGI world.
I agree that such a fast transition from AGI to superintelligence or god-like AI seems very dangerous. Thus, one either shouldn’t build AGI, or should somehow ensure that one has lots of time after AGI is built. Some possibilities for having lots of time:
Sufficient international cooperation to keep things slow.
A sufficient lead of the West over countries like China to have time for alignment
Option 2 leads to a race against China, and even if we end up with a lead, it’s unclear whether it will be sufficient to solve the hard problems of alignment. It’s also unclear whether the West could use already AGI (pre superintelligence) for a robust military advantage, and absent such an advantage, scenario 2 seems very unstable.
So a very cruxy question seems to be how feasible option 1 is. I think this compendium doesn’t do much to settle this debate, but I hope to learn more in the “Narrow Path”.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
That seems correct to me. Some people in EA claim that AI Safety is not neglected anymore, but I would say if we ever get confronted with the need to evaluate automated alignment research (possibly on a deadline), then AI Safety research might be extremely neglected.
The reactive framework reverses the burden of proof from how society typically regulates high-risk technologies and industries. In most areas of law, we do not wait for harm to occur before implementing safeguards.
My impression is that companies like Anthropic, DeepMind, and OpenAI talk about mechanisms that are proactive rather than reactive. E.g., responsible scaling policies define an ASL level before it exists, including evaluations for these levels. Then, mitigations need to be in place once the level is reached. Thus, decisively this framework does not want to wait until harm occurred.
I’m curious whether you disagree with this narrow claim (that RSP-like frameworks are proactive), or whether you just want to make the broader claim that it’s unclear how RSP-like frameworks could become widespread enforced regulation.
AI is being developed extremely quickly and by many actors, and the barrier to entry is low and quickly diminishing.
I think that the barrier to entry is not diminishing: to be at the frontier requires increasingly enormous resources.
Possibly your claim is that the barrier to entry for a given level of capabilities diminishes. I agree with that, but I’m unsure if it’s the most relevant consideration. I think for a given level of capabilities, the riskiest period is when it’s reached for the first time since humanity then won’t have experience in how to mitigate potential risks.
Paul Graham estimates training price for performance has decreased 100x in each of the last two years, or 10000x in two years.
If GPT-4′s costs were 100 million dollars, then it could be trained and released by March 2025 for 10k dollars. That seems quite cheap, so I’m not sure if I believe the numbers.
The reactive framework incorrectly assumes that an AI “warning shot” will motivate coordination.
I never saw this assumption explicitly expressed. Is your view that this is an implicit assumption?
Companies like Anthropic, OpenAI, etc., seem to have facilitated quite some discussion with the USG even without warning shots.
But history shows that it is exactly in such moments that these thresholds are most contested –- this shifting of the goalposts is known as the AI Effect and common enough to have its own Wikipedia page. Time and again, AI advancements have been explained away as routine processes, whereas “real AI” is redefined to be some mystical threshold we have not yet reached.
I would have found this paragraph convincing before ChatGPT. But now, with efforts like the USG national security memorandum, it seems like AI capabilities are being taken almost adequately seriously.
we’ve already seen competitors fight tooth and nail to keep building.
OpenAI thought that their models are considered high-risk in the EU AI act. I think arguing that this is inconsistent with OpenAI’s commitment for regulation would require to look at what the EU AI act actually said. I didn’t engage with it, but e.g. Zvi doesn’t seem to be impressed.
Anthropic released Claude, which they proudly (and correctly) describe as a state-of-the-art pushing model, contradicting their own Core Views on AI Safety, claiming “We generally don’t publish this kind of work because we do not wish to advance the rate of AI capabilities progress.”
The full quote in Anthropic’s article is:
“We generally don’t publish this kind of work because we do not wish to advance the rate of AI capabilities progress. In addition, we aim to be thoughtful about demonstrations of frontier capabilities (even without publication). We trained the first version of our headline model, Claude, in the spring of 2022, and decided to prioritize using it for safety research rather than public deployments. We’ve subsequently begun deploying Claude now that the gap between it and the public state of the art is smaller.”
This added context sounds quite different and seems to make clear that with “publish”, Anthropic means the publication of the methods to get to the capabilities. Additionally, I agree with Anthropic that releasing models now is less of a race-driver than it would have been in 2022, and so the current decisions seem more reasonable.
These policy proposals lack a roadmap for government enforcement, making them merely hypothetical mandates. Even worse, they add provisions to allow the companies to amend their own framework as they see fit, rather than codifying a resilient system. See Anthropic’s Responsible Scaling Policy: [...]
I agree that it is bad that there is no roadmap for government enforcement. But without such enforcement, and assuming Anthropic is reasonable, I think it makes sense for them to change their RSP in response to new evidence for what works. After all, we want the version that will eventually be encoded in law to be as sensible as possible.
Mechanistic interpretability, which tries to reverse-engineer AIs to understand how they work, which can then be used to advance and race even faster. [...] Scalable oversight, which is another term for whack-a-mole approaches where the current issues are incrementally “fixed” by training them away. This incentivizes obscuring issues rather than resolving them. This approach instead helps Anthropic build chatbots, providing a steady revenue stream.
This seems not argued well. It’s unclear how mechanistic interpretability would be used to advance the race further (unless you mean that it leads to safety-washing for more government trust and public trust?). Also, scalable oversight is so broad as a collection of strategies that I don’t think it’s fair to call them whack-a-mole strategies. E.g., I’d say many of the 11 proposals fall under this umbrella.
Thanks for this compendium, I quite enjoyed reading it. It also motivated me to read the “Narrow Path” soon.
I have a bunch of reactions/comments/questions at several places. I focus on the places that feel most “cruxy” to me. I formulate them without much hedging to facilitate a better discussion, though I feel quite uncertain about most things I write.
On AI Extinction
The part on extinction from AI seems badly argued to me. Is it fair to say that you mainly want to convey a basic intuition, with the hope that the readers will find extinction an “obvious” result?
To be clear: I think that for literal god-like AI, as described by you, an existential catastrophe is likely if we don’t solve a very hard case of alignment. For levels below (superintelligence, AGI), I become progressively more optimistic. Some of my hope comes from believing that humanity will eventually coordinate to not scale to god-like AI unless we have enormous assurances that alignment is solved; I think this is similar to your wish, but you hope that we already stop before even AGI is built.
On AI Safety
This is a topic where I’m pretty confused, but I still try to formulate a counterposition: I think we can probably align AI systems to constitutions, which then makes it unnecessary to solve all value differences. Whenever someone uses the AI, the AI needs to act in accordance with the constitution, which already has mechanisms for how to resolve value conflicts.
Additionally, the constitution could have mechanisms for how to change the constitution itself, so that humanity and AI could co-evolve to better values over time.
ELK might circumvent this issue: Just query an AI about its latent knowledge of future consequences of our actions.
This section seems quite interesting to me, but somewhat different from technical discussions of alignment I’m used to. It seems to me that this section is about problems similar to “intent alignment” or creating valid “training stories”, only that you want to define alignment as working correctly in the whole world, instead of just individual systems. Thus, the process design should also prevent problems like “multipolar failure” that might be overlooked by other paradigms. Is this a correct characterization?
Given that this section mainly operates at the level of analogies to politics, economics, and history, I think this section could profit from making stronger connections to AI itself.
That seems true, and it reminds me of deep deceptiveness, where an AI engages in deception without having any internal process that “looks like” deception.
I agree that such a fast transition from AGI to superintelligence or god-like AI seems very dangerous. Thus, one either shouldn’t build AGI, or should somehow ensure that one has lots of time after AGI is built. Some possibilities for having lots of time:
Sufficient international cooperation to keep things slow.
A sufficient lead of the West over countries like China to have time for alignment
Option 2 leads to a race against China, and even if we end up with a lead, it’s unclear whether it will be sufficient to solve the hard problems of alignment. It’s also unclear whether the West could use already AGI (pre superintelligence) for a robust military advantage, and absent such an advantage, scenario 2 seems very unstable.
So a very cruxy question seems to be how feasible option 1 is. I think this compendium doesn’t do much to settle this debate, but I hope to learn more in the “Narrow Path”.
That seems correct to me. Some people in EA claim that AI Safety is not neglected anymore, but I would say if we ever get confronted with the need to evaluate automated alignment research (possibly on a deadline), then AI Safety research might be extremely neglected.
AI Governance
My impression is that companies like Anthropic, DeepMind, and OpenAI talk about mechanisms that are proactive rather than reactive. E.g., responsible scaling policies define an ASL level before it exists, including evaluations for these levels. Then, mitigations need to be in place once the level is reached. Thus, decisively this framework does not want to wait until harm occurred.
I’m curious whether you disagree with this narrow claim (that RSP-like frameworks are proactive), or whether you just want to make the broader claim that it’s unclear how RSP-like frameworks could become widespread enforced regulation.
I think that the barrier to entry is not diminishing: to be at the frontier requires increasingly enormous resources.
Possibly your claim is that the barrier to entry for a given level of capabilities diminishes. I agree with that, but I’m unsure if it’s the most relevant consideration. I think for a given level of capabilities, the riskiest period is when it’s reached for the first time since humanity then won’t have experience in how to mitigate potential risks.
If GPT-4′s costs were 100 million dollars, then it could be trained and released by March 2025 for 10k dollars. That seems quite cheap, so I’m not sure if I believe the numbers.
I never saw this assumption explicitly expressed. Is your view that this is an implicit assumption?
Companies like Anthropic, OpenAI, etc., seem to have facilitated quite some discussion with the USG even without warning shots.
I would have found this paragraph convincing before ChatGPT. But now, with efforts like the USG national security memorandum, it seems like AI capabilities are being taken almost adequately seriously.
OpenAI thought that their models are considered high-risk in the EU AI act. I think arguing that this is inconsistent with OpenAI’s commitment for regulation would require to look at what the EU AI act actually said. I didn’t engage with it, but e.g. Zvi doesn’t seem to be impressed.
The AI Race
The full quote in Anthropic’s article is:
“We generally don’t publish this kind of work because we do not wish to advance the rate of AI capabilities progress. In addition, we aim to be thoughtful about demonstrations of frontier capabilities (even without publication). We trained the first version of our headline model, Claude, in the spring of 2022, and decided to prioritize using it for safety research rather than public deployments. We’ve subsequently begun deploying Claude now that the gap between it and the public state of the art is smaller.”
This added context sounds quite different and seems to make clear that with “publish”, Anthropic means the publication of the methods to get to the capabilities. Additionally, I agree with Anthropic that releasing models now is less of a race-driver than it would have been in 2022, and so the current decisions seem more reasonable.
I agree that it is bad that there is no roadmap for government enforcement. But without such enforcement, and assuming Anthropic is reasonable, I think it makes sense for them to change their RSP in response to new evidence for what works. After all, we want the version that will eventually be encoded in law to be as sensible as possible.
I think Anthropic also deserves some credit for communicating changes to the RSPs and learnings.
This seems not argued well. It’s unclear how mechanistic interpretability would be used to advance the race further (unless you mean that it leads to safety-washing for more government trust and public trust?). Also, scalable oversight is so broad as a collection of strategies that I don’t think it’s fair to call them whack-a-mole strategies. E.g., I’d say many of the 11 proposals fall under this umbrella.
I’d be happy for any reactions to my comments!