Thanks for the very in-depth case you’re making! I especially liked the parts about the objections, and your take on some AI Alignment researcher’s opinions of this proposal.
Personally, I’m enthusiastic about it with caveats expanded below. If I try to interpret your proposal according to the lines of my recent epistemological framing of AI Alignment research, you’re pushing for a specific kind of work on the Solving part of the field, where you assume a definition of the terms of the problem (what AIs will we build and what do we want). My caveats can be summarized by saying what I say in my post: that as long as we’re not really sure that we got the terms of the problem well-defined, we cannot make the whole field into this Solving part.
As a quick summary of what I get into in my detailed feedback, I think more work on this kind of problems will be net positive and very useful if:
we are able to get reasonably good guarantees that doing a specific experiment doesn’t present too big of a risk;
this kind of work stays in conversation with what you call conceptual work;
this kind of work doesn’t replace other kinds of AI Alignment research completely.
Also, I think a good example of a running research project doing something similar is Tournesol. I have a post explaining what it is, but the idea boils down to building a database of expert feedback on Youtube videos on multiple axes, and leverage it to train a more aligned recommendation algorithm for Youtube. One difference is that their idea does probably make the model more competent (it’s not already using a trained model like GPT-3); yet the similarities are numerous enough that you might find it interesting.
In general, it seems to me like building and iterating on prototypes is a huge part of how R&D progress is made in engineering fields, and it would be exciting if AI alignment could move in that direction.
I agree with the general idea that getting more experimental work will be immensely valuable, but I’m worried about the comparison with traditional engineering. AI Alignment cannot just follow engineering paradigms and wisdom of just prototyping stuff willy-nilly because every experiment could explode in our face. It seems closer to nuclear engineering, which required AFAIK a preliminary work and understanding of nuclear physics.
To summarize, I’m for finding constrained and safe ways to gather more experimental understanding, but pushing for more experiments without heeding the risks seems like one of the worst things we could do.
Holistically, this seems like a much safer situation to be in than one where the world has essentially procrastinated on figuring out how to align systems to fuzzy goals, doing only the minimum necessary to produce commercial products.
Is it the correct counterfactual, though? You seem to compare your proposed approach with a situation where no AI Alignment research is done. That hardly seems fair or representative of a plausible counterfactual.
Aligning narrowly superhuman models today could help build up tools, infrastructure, best practices, and tricks of the trade. I expect most of this will eventually be developed anyway, but speeding it up and improving its quality could still be quite valuable, especially in short timelines worlds where there’s a lot less time for things to take their natural course.
Well, it depends whether it’s easier to get from the conceptual details to the implementation details, or the other way around. My guess would be the former, which means that working on implementation details before knowing what we want to implement is at best a really unproductive use of research time (even more in short timelines), at worse a waste of time. I’m curious if you have argument for the opposite take.
Note that I’m specifically responding to this specific argument. I still think that experimental work can be tremendously valuable for solving the conceptual issues.
All this seems like it would make the world safer on the eve of transformative AI or AGI, and give humans more powerful and reliable tools for dealing with the TAI / AGI transition.
Agreed. That being said, pushing in this direction might also place us in a worse situation, for example by putting a lot of pressure on AIs to build human models which then make deception/manipulation significantly more accessible and worthwhile. I don’t really know how to think about this risk, but I certainly would want follow-up discussions on it.
More broadly, “doing empirical science on the alignment problem”—i.e. systematically studying what the main problem(s) are, how hard they are, what approaches are viable and how they scale, etc—could help us discover a number of different avenues for reducing long-run AI x-risk that we aren’t currently thinking of, one-shot technical solutions or otherwise.
Yes, yes and yes. Subject to preliminary thinking about the risks involved in such experimental research, that’s definitely a reason to push more for this kind of work.
Compared to conceptual research, I’d guess aligning narrowly superhuman models will feel meatier and more tractable to a number of people. It also seems like it would be easier for funders and peers to evaluate whether particular papers constitute progress, which would probably help create a healthier and more focused field where people are broadly more on the same page and junior researchers can get stronger mentorship. Related to both of these, I think it provides an easier opportunity for people who care about long-run x-risk to produce results that are persuasive and impressive to the broader ML community, as I mentioned above.
You present this as a positive, but I instead see a pretty big issue here. Because of everything you point out, most incentives will push towards doing only this kind of research. You’ll have more prestige, a better chance at a job, recognition by a bigger community. All of which is definitely good from a personal standpoint. Which means both that all newcomers will go on to the experimental type of work, and that such experiments will bear less and less relationship with the actual aligning of AI (and more and more with the specific kind of problems for which we find experimental solutions without the weird conceptual work).
In particular, you say that the field will be healthier because “people are more broadly on the same page”. That for me falls into the trap of believing that a paradigm is necessary the right way to structure a field of research trying to solve a problem. As I argue here, a paradigm in this case basically means that you think you have circumscribed the problem well enough to not question it any more, and work single-mindedly on it. We’re amazingly far from that point in AI Alignment, and so that looks really dangerous, especially because shorter timelines won’t allow more than one or two such paradigms to unfold.
When it’s possible to demonstrate an issue at scale, I think that’s usually a pretty clear win.
Agreed, with the caveat I’ve been repeating about the check for risks due to the scale.
I think we have a shot at eventually supplying a lot of people to work on it too. In the long run, I think more EAs could be in a position to contribute to this type of work than to either conceptual research or mainstream ML safety.
This looks about right. Although I wonder if it wouldn’t be dangerous to have a lot of people working on the topic that don’t get the conceptual risks and/or the underlying ML technology. So I’m wondering if having people without the conceptual or ML skills work on that kind of project is safe.
Thanks for the very in-depth case you’re making! I especially liked the parts about the objections, and your take on some AI Alignment researcher’s opinions of this proposal.
Personally, I’m enthusiastic about it with caveats expanded below. If I try to interpret your proposal according to the lines of my recent epistemological framing of AI Alignment research, you’re pushing for a specific kind of work on the Solving part of the field, where you assume a definition of the terms of the problem (what AIs will we build and what do we want). My caveats can be summarized by saying what I say in my post: that as long as we’re not really sure that we got the terms of the problem well-defined, we cannot make the whole field into this Solving part.
As a quick summary of what I get into in my detailed feedback, I think more work on this kind of problems will be net positive and very useful if:
we are able to get reasonably good guarantees that doing a specific experiment doesn’t present too big of a risk;
this kind of work stays in conversation with what you call conceptual work;
this kind of work doesn’t replace other kinds of AI Alignment research completely.
Also, I think a good example of a running research project doing something similar is Tournesol. I have a post explaining what it is, but the idea boils down to building a database of expert feedback on Youtube videos on multiple axes, and leverage it to train a more aligned recommendation algorithm for Youtube. One difference is that their idea does probably make the model more competent (it’s not already using a trained model like GPT-3); yet the similarities are numerous enough that you might find it interesting.
I agree with the general idea that getting more experimental work will be immensely valuable, but I’m worried about the comparison with traditional engineering. AI Alignment cannot just follow engineering paradigms and wisdom of just prototyping stuff willy-nilly because every experiment could explode in our face. It seems closer to nuclear engineering, which required AFAIK a preliminary work and understanding of nuclear physics.
To summarize, I’m for finding constrained and safe ways to gather more experimental understanding, but pushing for more experiments without heeding the risks seems like one of the worst things we could do.
Is it the correct counterfactual, though? You seem to compare your proposed approach with a situation where no AI Alignment research is done. That hardly seems fair or representative of a plausible counterfactual.
Well, it depends whether it’s easier to get from the conceptual details to the implementation details, or the other way around. My guess would be the former, which means that working on implementation details before knowing what we want to implement is at best a really unproductive use of research time (even more in short timelines), at worse a waste of time. I’m curious if you have argument for the opposite take.
Note that I’m specifically responding to this specific argument. I still think that experimental work can be tremendously valuable for solving the conceptual issues.
Agreed. That being said, pushing in this direction might also place us in a worse situation, for example by putting a lot of pressure on AIs to build human models which then make deception/manipulation significantly more accessible and worthwhile. I don’t really know how to think about this risk, but I certainly would want follow-up discussions on it.
Yes, yes and yes. Subject to preliminary thinking about the risks involved in such experimental research, that’s definitely a reason to push more for this kind of work.
You present this as a positive, but I instead see a pretty big issue here. Because of everything you point out, most incentives will push towards doing only this kind of research. You’ll have more prestige, a better chance at a job, recognition by a bigger community. All of which is definitely good from a personal standpoint. Which means both that all newcomers will go on to the experimental type of work, and that such experiments will bear less and less relationship with the actual aligning of AI (and more and more with the specific kind of problems for which we find experimental solutions without the weird conceptual work).
In particular, you say that the field will be healthier because “people are more broadly on the same page”. That for me falls into the trap of believing that a paradigm is necessary the right way to structure a field of research trying to solve a problem. As I argue here, a paradigm in this case basically means that you think you have circumscribed the problem well enough to not question it any more, and work single-mindedly on it. We’re amazingly far from that point in AI Alignment, and so that looks really dangerous, especially because shorter timelines won’t allow more than one or two such paradigms to unfold.
Agreed, with the caveat I’ve been repeating about the check for risks due to the scale.
This looks about right. Although I wonder if it wouldn’t be dangerous to have a lot of people working on the topic that don’t get the conceptual risks and/or the underlying ML technology. So I’m wondering if having people without the conceptual or ML skills work on that kind of project is safe.