If we are going to use the term burden of proof, I would suggest the burden of proof is on the people who claim that they could make potentially very dangerous systems safe using any (combination of) techniques.
Let’s also stay mindful that these claims are not being made in a vacuum. Incremental progress on making these models usable for users (which is what a lot of applied ML safety and alignment research comes down to) does enable AI corporations to keep scaling.
I think your first sentence is actually compatible with my view. If GPT-7 is very dangerous and OpenAI claims they can use some specific set of safety techniques to make it safe, I agree that the burden of proof is on them. But I also think the history of technology should make you expect on priors that the kind of safety research intended to solve actual safety problems (rather than safetywash) is net positive.
I don’t think it’s worth getting into why, but briefly it seems like the problems studied by many researchers are easier versions of problems that would make a big dent in alignment. For example, Evan wants to ultimately get to level 7 interpretability, which is just a harder version of levels 1-5.
I have not really thought about the other side—making models more usable enables more scaling (as distinct from the argument that understanding gained from interpretability is useful for capabilities) but it mostly seems confined to specific work done by labs that is pointed at usability rather than safety. Maybe you could randomly pick two MATS writeups from 2024 and argue that the usability impact makes them net harmful.
It’s hard to pin down ambiguity around how much alignment “techniques” make models more “usable”, and how much that in turn enables more “scaling”. This and the safety-washing concern gets us into messy considerations. Though I generally agree that participants of MATS or AISC programs can cause much less harm through either than researchers working directly on aligning eg. OpenAI’s models for release.
Our crux though is about the extent of progress that can be made – on engineering fully autonomous machinery to control* their own effects in line with continued human safety. I agree with you that such a system can be engineered to start off performing more** of the tasks we want it to complete (ie. progress on alignment is possible). At the same time, there are fundamental limits to controllability (ie. progress on alignment is capped).
This is where I think we need more discussion:
Is the extent of AGI control possible at least more than the extent of control needed (to prevent eventual convergence on causing human extinction)?
* I use the term “control” in the established control theory sense, consistent with Yampolskiy’s definition. Just to avoid confusing people, as the term gets used in more specialised ways in the alignment community (eg. in conversations about the shut-down problem or control agenda). ** This is a rough way of stating it. It’s also about the machinery performing fewer of the tasks we wouldn’t want the system to complete. And the relevant measure is not as much about the number of preferred tasks performed, as the preferred consequences. Finally, this raises a question about who the ‘we’ is who can express preferences that the system is to act in line with, and whether coherent alignment with different persons’ preferences expressed from within different perceived contexts is even a sound concept.
Generally the way that people solve hard problems is to solve related easy problems first, and this is true even if the technology in question gets much more powerful. Imagine if we had to land rockets on barges before anyone had invented PID controllers and observed their failure modes.
This raises questions about the reference class.
Does controlling a self-learning (and evolving) system fit in the same reference class as the problems that engineers have “generally” been able to solve (such as moving rockets)?
Is the notion of “powerful” technologies in the sense of eg. rockets being powerful the same notion as “powerful” in the sense of fully autonomous learning being powerful?
Based on this, can we rely on the reference class of past “powerful” technologies as an indicator of being able to make incremental progress on making and keeping “AGI” safe?
If we are going to use the term burden of proof, I would suggest the burden of proof is on the people who claim that they could make potentially very dangerous systems safe using any (combination of) techniques.
Let’s also stay mindful that these claims are not being made in a vacuum. Incremental progress on making these models usable for users (which is what a lot of applied ML safety and alignment research comes down to) does enable AI corporations to keep scaling.
I think your first sentence is actually compatible with my view. If GPT-7 is very dangerous and OpenAI claims they can use some specific set of safety techniques to make it safe, I agree that the burden of proof is on them. But I also think the history of technology should make you expect on priors that the kind of safety research intended to solve actual safety problems (rather than safetywash) is net positive.
I don’t think it’s worth getting into why, but briefly it seems like the problems studied by many researchers are easier versions of problems that would make a big dent in alignment. For example, Evan wants to ultimately get to level 7 interpretability, which is just a harder version of levels 1-5.
I have not really thought about the other side—making models more usable enables more scaling (as distinct from the argument that understanding gained from interpretability is useful for capabilities) but it mostly seems confined to specific work done by labs that is pointed at usability rather than safety. Maybe you could randomly pick two MATS writeups from 2024 and argue that the usability impact makes them net harmful.
Appreciating your thoughtful comment.
It’s hard to pin down ambiguity around how much alignment “techniques” make models more “usable”, and how much that in turn enables more “scaling”. This and the safety-washing concern gets us into messy considerations. Though I generally agree that participants of MATS or AISC programs can cause much less harm through either than researchers working directly on aligning eg. OpenAI’s models for release.
Our crux though is about the extent of progress that can be made – on engineering fully autonomous machinery to control* their own effects in line with continued human safety. I agree with you that such a system can be engineered to start off performing more** of the tasks we want it to complete (ie. progress on alignment is possible). At the same time, there are fundamental limits to controllability (ie. progress on alignment is capped).
This is where I think we need more discussion:
Is the extent of AGI control possible at least more than the extent of control needed
(to prevent eventual convergence on causing human extinction)?
* I use the term “control” in the established control theory sense, consistent with Yampolskiy’s definition. Just to avoid confusing people, as the term gets used in more specialised ways in the alignment community (eg. in conversations about the shut-down problem or control agenda).
** This is a rough way of stating it. It’s also about the machinery performing fewer of the tasks we wouldn’t want the system to complete. And the relevant measure is not as much about the number of preferred tasks performed, as the preferred consequences. Finally, this raises a question about who the ‘we’ is who can express preferences that the system is to act in line with, and whether coherent alignment with different persons’ preferences expressed from within different perceived contexts is even a sound concept.
This raises questions about the reference class.
Does controlling a self-learning (and evolving) system fit in the same reference class as the problems that engineers have “generally” been able to solve (such as moving rockets)?
Is the notion of “powerful” technologies in the sense of eg. rockets being powerful the same notion as “powerful” in the sense of fully autonomous learning being powerful?
Based on this, can we rely on the reference class of past “powerful” technologies as an indicator of being able to make incremental progress on making and keeping “AGI” safe?