In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:
How to evaluate whether models should be trusted or untrusted: currently I don’t have a good answer and this is bottlenecking the efforts to write concrete control proposals.
How AI control should interact with AI security tools inside labs.
More generally:
How can we get more evidence on whether scheming is plausible?
“How can we get more evidence on whether scheming is plausible?”—What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.
I’d say 1 important question is whether the AI control strategy works out as they hope.
I agree with Bogdan that making adequate safety cases for automated safety research is probably one of the most important technical problems to answer (since conditional on the automating AI safety direction working out, then it could eclipse basically all safety research done prior to the automation, and this might hold even if LWers really had basically perfect epistemics given what’s possible for humans, and picked closer to optimal directions, since labor is a huge bottleneck, and allows for much tighter feedback loops of progress, for the reasons Tamay Besiroglu identified):
1
Are we indeed (as I suspect) in a massive overhang of compute and data for powerful agentic AGI? (If so, then at any moment someone could stumble across an algorithmic improvement which would change everything overnight.)
2
Current frontier models seem much more powerful than mouse brains, yet mice seem conscious. This implies that either LLMs are already conscious, or could easily be made so with non-costly tweaks to their algorithm. How could we objectively tell if an AI were conscious?
3
Over the past year I’ve helped make both safe-evals-of-danger-adjacent-capabilities (e.g. WMDP.ai) and unpublished infohazardous-evals-of-actually-dangerous-capabilities. One of the most common pieces of negative feedback I’ve heard on the safe-evals is that they are only danger-adjacent, not measuring truly dangerous things. How could we safely show the correlation of capabilities between high performance on danger-adjacent evals with high performance on actually-dangerous evals?
Are we indeed (as I suspect) in a massive overhang of compute and data for powerful agentic AGI? (If so, then at any moment someone could stumble across an algorithmic improvement which would change everything overnight.)
Why is this relevant for technical AI alignment (coming at this as someone skeptical about how relevant timeline considerations are more generally)?
If tomorrow anyone in the world could cheaply and easily create an AGI which could act as a coherent agent on their behalf, and was based on an architecture different from a standard transformer.… Seems like this would change a lot of people’s priorities about which questions were most urgent to answer.
Fwiw I basically think you are right about the agentic AI overhang and obviously so.
I do think it shapes how one thinks about what’s most valuable in AI alignment.
I kind of wished you both gave some reasoning as to why you believe that the agentic AI overhang/algorithmic overhang is likely, and I also wish that Nathan Helm Burger and Vladimir Nesov discussed this topic in a dialogue post.
Glib formality: current LLMs do approximate something like a speed prior solomonoff inductor for internetdata but do not approximate AIXI.
There is a whole class of domains that are not tractably accesible from next-token prediction on human generated data. For instance, learning how to beat alphaGo with only access to pre2014 human go games.
IMO, I think AlphaGo’s success was orthogonal to AIXI, and more importantly, I expect AIXI to be very hard to approximate even as an approximatable ideal, so what’s the use case for thinking future AIs will be AIXI-like?
I will also say that while I don’t think pure LLMs will be just scaled forwards, just because there’s a use for inference time compute scaling, I think that conditional on AGI and ASI being achieved, the strategy will look more iike using lots and lots of synthetic data to compensate for compute, whereas Solomonoff induction has a halting oracle with lots of compute, and can infer lots of things with the minimum data possible, while we will rely on a data-rich, compute poor strategy compared to approximate AIXI.
The important thing is that both do active learning & decisionmaking & search, i.e. RL. *
LLMs don’t do that. So the gain from doing that is huge.
Synthetic data is a bit of a weird word that get’s thrown around a lot. There are fundamental limits on how much information resampling from the same data source will yield about completely different domains. So that seems a bit silly. Ofc sometimes with synthetic data people just mean doing rollouts, i.e. RL.
*the word RL sometimes gets mistaken for only very specific reinforcement learning algorithm. I mean here a very general class of algorithms that solve MDPs.
The lack of a robust, highly general paradigm for reasoning about AGI models is the current greatest technical problem, although it is not what most people are working on.
What features of architecture of contemporary AI models will occur in future models that pose an existential risk?
What behavioral patterns of contemporary AI models will be shared with future models that pose an existential risk?
Is there a useful and general mathematical/physical framework that describes how agentic, macroscropic systems process information and interact with the environment?
Does terminology adopted by AI Safety researchers like “scheming”, “inner alignment” or “agent” carve nature at the joints?
What is the easiest and most practical way to translate legalese into scientifically accurate terms, thus bridging the gap between AI experts and lawyers? Stated differently, how do we move from localised papers that only work in law or AI fields respectively, to papers that work in both?
Are Eliezer and Nate right that continuing the AI program will almost certainly lead to extinction or something approximately as disastrous as extinction?
What’s the most important technical question in AI safety right now?
In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:
How to evaluate whether models should be trusted or untrusted: currently I don’t have a good answer and this is bottlenecking the efforts to write concrete control proposals.
How AI control should interact with AI security tools inside labs.
More generally:
How can we get more evidence on whether scheming is plausible?
How scary is underelicitation? How much should the results about password-locked models or arguments about being able to generate small numbers of high-quality labels or demonstrations affect this?
“How can we get more evidence on whether scheming is plausible?”—What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.
I’d say 1 important question is whether the AI control strategy works out as they hope.
I agree with Bogdan that making adequate safety cases for automated safety research is probably one of the most important technical problems to answer (since conditional on the automating AI safety direction working out, then it could eclipse basically all safety research done prior to the automation, and this might hold even if LWers really had basically perfect epistemics given what’s possible for humans, and picked closer to optimal directions, since labor is a huge bottleneck, and allows for much tighter feedback loops of progress, for the reasons Tamay Besiroglu identified):
https://x.com/tamaybes/status/1851743632161935824
https://x.com/tamaybes/status/1848457491736133744
Here’s some candidates:
1 Are we indeed (as I suspect) in a massive overhang of compute and data for powerful agentic AGI? (If so, then at any moment someone could stumble across an algorithmic improvement which would change everything overnight.)
2 Current frontier models seem much more powerful than mouse brains, yet mice seem conscious. This implies that either LLMs are already conscious, or could easily be made so with non-costly tweaks to their algorithm. How could we objectively tell if an AI were conscious?
3 Over the past year I’ve helped make both safe-evals-of-danger-adjacent-capabilities (e.g. WMDP.ai) and unpublished infohazardous-evals-of-actually-dangerous-capabilities. One of the most common pieces of negative feedback I’ve heard on the safe-evals is that they are only danger-adjacent, not measuring truly dangerous things. How could we safely show the correlation of capabilities between high performance on danger-adjacent evals with high performance on actually-dangerous evals?
Why is this relevant for technical AI alignment (coming at this as someone skeptical about how relevant timeline considerations are more generally)?
If tomorrow anyone in the world could cheaply and easily create an AGI which could act as a coherent agent on their behalf, and was based on an architecture different from a standard transformer.… Seems like this would change a lot of people’s priorities about which questions were most urgent to answer.
Fwiw I basically think you are right about the agentic AI overhang and obviously so. I do think it shapes how one thinks about what’s most valuable in AI alignment.
I kind of wished you both gave some reasoning as to why you believe that the agentic AI overhang/algorithmic overhang is likely, and I also wish that Nathan Helm Burger and Vladimir Nesov discussed this topic in a dialogue post.
Glib formality: current LLMs do approximate something like a speed prior solomonoff inductor for internetdata but do not approximate AIXI.
There is a whole class of domains that are not tractably accesible from next-token prediction on human generated data. For instance, learning how to beat alphaGo with only access to pre2014 human go games.
IMO, I think AlphaGo’s success was orthogonal to AIXI, and more importantly, I expect AIXI to be very hard to approximate even as an approximatable ideal, so what’s the use case for thinking future AIs will be AIXI-like?
I will also say that while I don’t think pure LLMs will be just scaled forwards, just because there’s a use for inference time compute scaling, I think that conditional on AGI and ASI being achieved, the strategy will look more iike using lots and lots of synthetic data to compensate for compute, whereas Solomonoff induction has a halting oracle with lots of compute, and can infer lots of things with the minimum data possible, while we will rely on a data-rich, compute poor strategy compared to approximate AIXI.
The important thing is that both do active learning & decisionmaking & search, i.e. RL. *
LLMs don’t do that. So the gain from doing that is huge.
Synthetic data is a bit of a weird word that get’s thrown around a lot. There are fundamental limits on how much information resampling from the same data source will yield about completely different domains. So that seems a bit silly. Ofc sometimes with synthetic data people just mean doing rollouts, i.e. RL.
*the word RL sometimes gets mistaken for only very specific reinforcement learning algorithm. I mean here a very general class of algorithms that solve MDPs.
The lack of a robust, highly general paradigm for reasoning about AGI models is the current greatest technical problem, although it is not what most people are working on.
What features of architecture of contemporary AI models will occur in future models that pose an existential risk?
What behavioral patterns of contemporary AI models will be shared with future models that pose an existential risk?
Is there a useful and general mathematical/physical framework that describes how agentic, macroscropic systems process information and interact with the environment?
Does terminology adopted by AI Safety researchers like “scheming”, “inner alignment” or “agent” carve nature at the joints?
Something like a safety case for automated safety research (but I’m biased)
Answering this from a legal perspective:
Are Eliezer and Nate right that continuing the AI program will almost certainly lead to extinction or something approximately as disastrous as extinction?