The burden of proof is on you that current safety research is not incremental progress towards safety research that matters on superintelligent AI. Generally the way that people solve hard problems is to solve related easy problems first, and this is true even if the technology in question gets much more powerful. Imagine if we had to land rockets on barges before anyone had invented PID controllers and observed their failure modes.
Also, the directions suggested in section 5 of the paper you linked seem to fall well within the bounds of normal AI safety research.
Edit: Two people reacted to taboo “burden of proof”. I mean that the claim is contrary to reference classes I can think of, and to argue for it there needs to be some argument why it is true in this case. It is also possible that the safety effect is significant but outweighted by the speedup effect, but that should also be clearly stated if it is what OP believes.
If we are going to use the term burden of proof, I would suggest the burden of proof is on the people who claim that they could make potentially very dangerous systems safe using any (combination of) techniques.
Let’s also stay mindful that these claims are not being made in a vacuum. Incremental progress on making these models usable for users (which is what a lot of applied ML safety and alignment research comes down to) does enable AI corporations to keep scaling.
I think your first sentence is actually compatible with my view. If GPT-7 is very dangerous and OpenAI claims they can use some specific set of safety techniques to make it safe, I agree that the burden of proof is on them. But I also think the history of technology should make you expect on priors that the kind of safety research intended to solve actual safety problems (rather than safetywash) is net positive.
I don’t think it’s worth getting into why, but briefly it seems like the problems studied by many researchers are easier versions of problems that would make a big dent in alignment. For example, Evan wants to ultimately get to level 7 interpretability, which is just a harder version of levels 1-5.
I have not really thought about the other side—making models more usable enables more scaling (as distinct from the argument that understanding gained from interpretability is useful for capabilities) but it mostly seems confined to specific work done by labs that is pointed at usability rather than safety. Maybe you could randomly pick two MATS writeups from 2024 and argue that the usability impact makes them net harmful.
It’s hard to pin down ambiguity around how much alignment “techniques” make models more “usable”, and how much that in turn enables more “scaling”. This and the safety-washing concern gets us into messy considerations. Though I generally agree that participants of MATS or AISC programs can cause much less harm through either than researchers working directly on aligning eg. OpenAI’s models for release.
Our crux though is about the extent of progress that can be made – on engineering fully autonomous machinery to control* their own effects in line with continued human safety. I agree with you that such a system can be engineered to start off performing more** of the tasks we want it to complete (ie. progress on alignment is possible). At the same time, there are fundamental limits to controllability (ie. progress on alignment is capped).
This is where I think we need more discussion:
Is the extent of AGI control possible at least more than the extent of control needed (to prevent eventual convergence on causing human extinction)?
* I use the term “control” in the established control theory sense, consistent with Yampolskiy’s definition. Just to avoid confusing people, as the term gets used in more specialised ways in the alignment community (eg. in conversations about the shut-down problem or control agenda). ** This is a rough way of stating it. It’s also about the machinery performing fewer of the tasks we wouldn’t want the system to complete. And the relevant measure is not as much about the number of preferred tasks performed, as the preferred consequences. Finally, this raises a question about who the ‘we’ is who can express preferences that the system is to act in line with, and whether coherent alignment with different persons’ preferences expressed from within different perceived contexts is even a sound concept.
Generally the way that people solve hard problems is to solve related easy problems first, and this is true even if the technology in question gets much more powerful. Imagine if we had to land rockets on barges before anyone had invented PID controllers and observed their failure modes.
This raises questions about the reference class.
Does controlling a self-learning (and evolving) system fit in the same reference class as the problems that engineers have “generally” been able to solve (such as moving rockets)?
Is the notion of “powerful” technologies in the sense of eg. rockets being powerful the same notion as “powerful” in the sense of fully autonomous learning being powerful?
Based on this, can we rely on the reference class of past “powerful” technologies as an indicator of being able to make incremental progress on making and keeping “AGI” safe?
I think logically the safety research needs to more than incrementally progress toward alignment (your implied claim in that burden of proof). It needs to speed alignment toward the finish line (working alignment for the AGI we actually build) more than it speeds capabilities toward the finish line of building takeover-capable AGI.
I agree with you that in general, research tends to make progress toward its stated goals.
But isn’t it a little odd that nobody I know of has a specific story for how we get from tuning and interpretability of LLMs to functionally safe AGI and ASI? I do have such a story, but the tuning and interpretability play only a minor role despite making up the vast bulk of “safety research”.
Research usually just goes in a general direction, and gets unexpected benefits as well as eventually accomplishing some of its stated goals. But having a more specific roadmap seems wise when some of those “unexpected benefits” might kill everyone.
That’s not to say I think we should shut down safety research; I just think we should have a bit more of a plan for how it accomplishes the stated goals. I’m afraid we’ve gotten a bit distracted from AGI x-risk by making LLMs safe—when nobody ever thought LLMs by themselves are likely to be very dangerous.
Generally the way that people solve hard problems is to solve related easy problems first, and this is true even if the technology in question gets much more powerful.
Sure, but if you want to do this kind of research, you should do it in such a way that it does not end up making the situation worse by helping “the AI project” (the irresponsible AI labs burning the time we have left before AI kills us all). That basically means keeping your research results secret from the AI project, and merely refraining from publishing your results is insufficient IMHO because employees in the private sector are free to leave your safety lab and go work for an irresponsible lab. It would be quite helpful here if an organization doing the kind of safety research you want had the same level of control over its employees that secret military projects currently have: namely, the ability to credibly threaten your employees with decades in jail if they bring the secrets they learned in your employ to organizations that should not have those secrets.
The way things are now, the main effect of the AI safety project is to give unintentional help to the AI project IMHO.
The burden of proof is on you that current safety research is not incremental progress towards safety research that matters on superintelligent AI. Generally the way that people solve hard problems is to solve related easy problems first, and this is true even if the technology in question gets much more powerful. Imagine if we had to land rockets on barges before anyone had invented PID controllers and observed their failure modes.
Also, the directions suggested in section 5 of the paper you linked seem to fall well within the bounds of normal AI safety research.
Edit: Two people reacted to taboo “burden of proof”. I mean that the claim is contrary to reference classes I can think of, and to argue for it there needs to be some argument why it is true in this case. It is also possible that the safety effect is significant but outweighted by the speedup effect, but that should also be clearly stated if it is what OP believes.
If we are going to use the term burden of proof, I would suggest the burden of proof is on the people who claim that they could make potentially very dangerous systems safe using any (combination of) techniques.
Let’s also stay mindful that these claims are not being made in a vacuum. Incremental progress on making these models usable for users (which is what a lot of applied ML safety and alignment research comes down to) does enable AI corporations to keep scaling.
I think your first sentence is actually compatible with my view. If GPT-7 is very dangerous and OpenAI claims they can use some specific set of safety techniques to make it safe, I agree that the burden of proof is on them. But I also think the history of technology should make you expect on priors that the kind of safety research intended to solve actual safety problems (rather than safetywash) is net positive.
I don’t think it’s worth getting into why, but briefly it seems like the problems studied by many researchers are easier versions of problems that would make a big dent in alignment. For example, Evan wants to ultimately get to level 7 interpretability, which is just a harder version of levels 1-5.
I have not really thought about the other side—making models more usable enables more scaling (as distinct from the argument that understanding gained from interpretability is useful for capabilities) but it mostly seems confined to specific work done by labs that is pointed at usability rather than safety. Maybe you could randomly pick two MATS writeups from 2024 and argue that the usability impact makes them net harmful.
Appreciating your thoughtful comment.
It’s hard to pin down ambiguity around how much alignment “techniques” make models more “usable”, and how much that in turn enables more “scaling”. This and the safety-washing concern gets us into messy considerations. Though I generally agree that participants of MATS or AISC programs can cause much less harm through either than researchers working directly on aligning eg. OpenAI’s models for release.
Our crux though is about the extent of progress that can be made – on engineering fully autonomous machinery to control* their own effects in line with continued human safety. I agree with you that such a system can be engineered to start off performing more** of the tasks we want it to complete (ie. progress on alignment is possible). At the same time, there are fundamental limits to controllability (ie. progress on alignment is capped).
This is where I think we need more discussion:
Is the extent of AGI control possible at least more than the extent of control needed
(to prevent eventual convergence on causing human extinction)?
* I use the term “control” in the established control theory sense, consistent with Yampolskiy’s definition. Just to avoid confusing people, as the term gets used in more specialised ways in the alignment community (eg. in conversations about the shut-down problem or control agenda).
** This is a rough way of stating it. It’s also about the machinery performing fewer of the tasks we wouldn’t want the system to complete. And the relevant measure is not as much about the number of preferred tasks performed, as the preferred consequences. Finally, this raises a question about who the ‘we’ is who can express preferences that the system is to act in line with, and whether coherent alignment with different persons’ preferences expressed from within different perceived contexts is even a sound concept.
This raises questions about the reference class.
Does controlling a self-learning (and evolving) system fit in the same reference class as the problems that engineers have “generally” been able to solve (such as moving rockets)?
Is the notion of “powerful” technologies in the sense of eg. rockets being powerful the same notion as “powerful” in the sense of fully autonomous learning being powerful?
Based on this, can we rely on the reference class of past “powerful” technologies as an indicator of being able to make incremental progress on making and keeping “AGI” safe?
I think logically the safety research needs to more than incrementally progress toward alignment (your implied claim in that burden of proof). It needs to speed alignment toward the finish line (working alignment for the AGI we actually build) more than it speeds capabilities toward the finish line of building takeover-capable AGI.
I agree with you that in general, research tends to make progress toward its stated goals.
But isn’t it a little odd that nobody I know of has a specific story for how we get from tuning and interpretability of LLMs to functionally safe AGI and ASI? I do have such a story, but the tuning and interpretability play only a minor role despite making up the vast bulk of “safety research”.
Research usually just goes in a general direction, and gets unexpected benefits as well as eventually accomplishing some of its stated goals. But having a more specific roadmap seems wise when some of those “unexpected benefits” might kill everyone.
That’s not to say I think we should shut down safety research; I just think we should have a bit more of a plan for how it accomplishes the stated goals. I’m afraid we’ve gotten a bit distracted from AGI x-risk by making LLMs safe—when nobody ever thought LLMs by themselves are likely to be very dangerous.
Sure, but if you want to do this kind of research, you should do it in such a way that it does not end up making the situation worse by helping “the AI project” (the irresponsible AI labs burning the time we have left before AI kills us all). That basically means keeping your research results secret from the AI project, and merely refraining from publishing your results is insufficient IMHO because employees in the private sector are free to leave your safety lab and go work for an irresponsible lab. It would be quite helpful here if an organization doing the kind of safety research you want had the same level of control over its employees that secret military projects currently have: namely, the ability to credibly threaten your employees with decades in jail if they bring the secrets they learned in your employ to organizations that should not have those secrets.
The way things are now, the main effect of the AI safety project is to give unintentional help to the AI project IMHO.