Thanks so much for writing this! I think it’s a very useful resource to have. I wanted to add a few thoughts on your description of CAIS, which might help make it more accurate.
[Note: I worked full time at CAIS from its inception until a couple weeks ago. I now work there on a part time basis while finishing university. This comment hasn’t been reviewed by others at CAIS, but I’m pretty confident it’s accurate.]
For somebody external to CAIS, I think you did a fairly good job describing the organization so thank you! I have a couple things I’d probably change:
First, our outreach is not just to academics, but also to people in industry. We usually use the term “ML community” rather than “academia” for this reason.
Second, the technical research side of the organization is about a lot more than robustness. We do research in Trojans as you mention, which isn’t robustness, but also in machine ethics, cooperative AI, anomaly detection, forecasting, and probably more areas soon. We are interested in most of the areas in Open Problems in AI X-Risk, but the extent to which we’re actively working on them varies.
I also think it might be good to add our newly-announced (so maybe after you wrote the post) Philosophy Fellowship, which focuses on recruiting philosophers to study foundational conceptual problems in AI risk. This might correct a misconception that CAIS isn’t interested in conceptual research; we very much are, but of a different flavor than some others, which I would broadly characterize as “more like philosophy, less like math”.
Also, there is no way you would have known about this since we’ve never said it publicly anywhere, but we intend to also build out compute and research engineering infrastructure for academics specifically, who often don’t have funding for compute and even if they do don’t have the support necessary to leverage it. Building out a centralized way for safety academics to access compute and engineering support would create economies of scale (especially the compute contracts and compute infrastructure). However, these plans are in early stages.
Another fieldbuilding effort maybe worth mentioning is ML Safety Scholars.
In general, here is how I personally describe the theory of change for CAIS. This hasn’t been reviewed by anyone, and I don’t know how much Dan personally likes it, but it’s how I think of it. It’s also not very polished, sorry. Anyway, to me there are three major forms of research:
Philosophizing. Many AI safety problems are still very undefined. We need people to think about the properties of possible systems at a high level and tease out relevant considerations and possible solutions. This is exactly what philosophers do and why we are interested in the program above. Without this kind of conceptual research, it’s very difficult to figure out concrete problems to work on.
Concretization. It does us no good if the ideas generated in philosophizing are never concretized. Part of this is because no amount of thinking can substitute for real experimentation and implementation. Part of this is because it won’t be long before we really need progress: we can’t afford to just philosophize. Concretization involves taking the high level ideas and implementing something that usefully situates them in empirical systems. Benchmarks are an example of this.
Iterative improvements. Once an idea is concretized, the initial concretization is likely not optimal. We need people to make tweaks and make the initial methods better at achieving their aims, according to the concretized ideas. Most papers produced by the broader ML community are iterative improvements.
CAIS intends to be the glue that integrates all three of these areas. Through our philosophy fellowship program, we will train philosophers to do useful conceptual research while working in close proximity with ML researchers. Most of our ML research focuses on building foundational methods and benchmarks that can take fuzzy problems and concretize them. Lastly, we see our fieldbuilding effort as very much driving iterative improvements: who better to make iterative improvements on well-defined safety problems than the ML community? They have shown themselves to be quite good at this when it comes to general capabilities.
For a more in depth look at our research theory of impact, I suggest Pragmatic AI Safety.
Edit: I realized your post made me actually write things up that I hadn’t before, because I thought it would likely be more accurate than the (great for an outsider!) description that you had written. This strikes me as a very positive outcome of this post, and I hope others who feel their descriptions miss something will do the same!
My worry is that academics will pursue strategies that work right now but won’t work for AGI, because they are trying to win the competition instead of align AGIs. This might be really helpful though.
(My personal opinion, not necesasarily the opinion of CAIS) I pretty much agree. It’s the job of the concretizers (and also grantmakers to some extent) to incentivize/nudge research to be in a useful direction rather than a nonuseful direction, and for fieldbuilding to shift researchers towards more explicitly considering x-risk. But, as you say, competition can be a valuable force; if you can set the incentives right, it might not be necessary for all researchers to be caring about x-risk. If you can give them a fun problem to solve and make sure it’s actually relevant and they are only rewarded for actually relevant work, then good research could still be produced. Relevant research has been produced by the ML community before by people who weren’t explicitly thinking about x-risk (mostly “accidentally”, i.e. not because anyone who cared about x-risk told them/incentivized them to, but hopefully this will change).
Also, iterative progress involves making progress that works now but might not in the future. That’s ok, as long as some of it does in fact work in the future.
. If you can give them a fun problem to solve and make sure it’s actually relevant and they are only rewarded for actually relevant work, then good research could still be produced.
Yeah I think the difficulty of setting this up correctly is the main crux. I’m quite uncertain on this, but I’ll give the argument my model of John Wentworth makes against this:
The Trojan detection competition it does seem roughly similar to deception, and if you can find Trojan’s really well, it’s plausible that you can find deceptive alignment. However, what we really need is a way to exert optimization pressure away from deceptive regions of parameter space. And right now, afaik, we have no idea how strongly deception is favored.
I can imagine using methods from this competition to put a small amount of pressure away from this, by, e.g., restarting whenever you see deception, or running SGD on your interpreted deception. But this feels sketchy because 1) you are putting pressure on these tools, and you might just steer into regions of space where they fail, and 2) you are training a model until it becomes deceptive: eventually, a smart deceptive model will be actively trying to beat these tools.
So what I really want is understanding the generators of deceptive alignment, which could take the form of formal version of the argument given here, so that I can prevent entering the deceptive regions of parameter space in the first place.
Relevant research has been produced by the ML community before by people who weren’t explicitly thinking about x-risk (mostly “accidentally”, i.e. not because anyone who cared about x-risk told them/incentivized them to, but hopefully this will change).
Could you link an example? I am curious what you have in mind. I’m guessing something like the ROME paper?
Thoughts on John’s comment: this is a problem with any method for detecting deception that isn’t 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck.
Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction.
Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant.
In general, it seems better to me to evaluate research by asking “where is this taking the field/what follow-up research is this motivating?” rather than “how are the words in this paper directly useful if we had to build AGI right now?” Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I’m pretty skeptical of a lot of the direct value of empirical research.
Thanks so much for writing this! I think it’s a very useful resource to have. I wanted to add a few thoughts on your description of CAIS, which might help make it more accurate.
[Note: I worked full time at CAIS from its inception until a couple weeks ago. I now work there on a part time basis while finishing university. This comment hasn’t been reviewed by others at CAIS, but I’m pretty confident it’s accurate.]
For somebody external to CAIS, I think you did a fairly good job describing the organization so thank you! I have a couple things I’d probably change:
First, our outreach is not just to academics, but also to people in industry. We usually use the term “ML community” rather than “academia” for this reason.
Second, the technical research side of the organization is about a lot more than robustness. We do research in Trojans as you mention, which isn’t robustness, but also in machine ethics, cooperative AI, anomaly detection, forecasting, and probably more areas soon. We are interested in most of the areas in Open Problems in AI X-Risk, but the extent to which we’re actively working on them varies.
I also think it might be good to add our newly-announced (so maybe after you wrote the post) Philosophy Fellowship, which focuses on recruiting philosophers to study foundational conceptual problems in AI risk. This might correct a misconception that CAIS isn’t interested in conceptual research; we very much are, but of a different flavor than some others, which I would broadly characterize as “more like philosophy, less like math”.
Also, there is no way you would have known about this since we’ve never said it publicly anywhere, but we intend to also build out compute and research engineering infrastructure for academics specifically, who often don’t have funding for compute and even if they do don’t have the support necessary to leverage it. Building out a centralized way for safety academics to access compute and engineering support would create economies of scale (especially the compute contracts and compute infrastructure). However, these plans are in early stages.
Another fieldbuilding effort maybe worth mentioning is ML Safety Scholars.
In general, here is how I personally describe the theory of change for CAIS. This hasn’t been reviewed by anyone, and I don’t know how much Dan personally likes it, but it’s how I think of it. It’s also not very polished, sorry. Anyway, to me there are three major forms of research:
Philosophizing. Many AI safety problems are still very undefined. We need people to think about the properties of possible systems at a high level and tease out relevant considerations and possible solutions. This is exactly what philosophers do and why we are interested in the program above. Without this kind of conceptual research, it’s very difficult to figure out concrete problems to work on.
Concretization. It does us no good if the ideas generated in philosophizing are never concretized. Part of this is because no amount of thinking can substitute for real experimentation and implementation. Part of this is because it won’t be long before we really need progress: we can’t afford to just philosophize. Concretization involves taking the high level ideas and implementing something that usefully situates them in empirical systems. Benchmarks are an example of this.
Iterative improvements. Once an idea is concretized, the initial concretization is likely not optimal. We need people to make tweaks and make the initial methods better at achieving their aims, according to the concretized ideas. Most papers produced by the broader ML community are iterative improvements.
CAIS intends to be the glue that integrates all three of these areas. Through our philosophy fellowship program, we will train philosophers to do useful conceptual research while working in close proximity with ML researchers. Most of our ML research focuses on building foundational methods and benchmarks that can take fuzzy problems and concretize them. Lastly, we see our fieldbuilding effort as very much driving iterative improvements: who better to make iterative improvements on well-defined safety problems than the ML community? They have shown themselves to be quite good at this when it comes to general capabilities.
For a more in depth look at our research theory of impact, I suggest Pragmatic AI Safety.
Edit: I realized your post made me actually write things up that I hadn’t before, because I thought it would likely be more accurate than the (great for an outsider!) description that you had written. This strikes me as a very positive outcome of this post, and I hope others who feel their descriptions miss something will do the same!
Thank you Thomas, I really appreciate you taking the time to write out your comment, it is very useful feedback.
I’ve linked your comment in the post and rewritten the description of CAIS.
Thanks! I really appreciate it, and think it’s a lot more accurate now. Nitpicks:
I think the MLSS link is currently broken. Also, in the headline table, it still emphasizes model robustness perhaps more than is warranted.
Right! I’ve changed both.
I confused CAIS with Drexler’s Comprehensive AI Services. Can you add a clarification stating that they are different things?
Good point. We’ve added the Center for AI Safety’s full name into the summary table which should help.
Also, as to your comment:
(My personal opinion, not necesasarily the opinion of CAIS) I pretty much agree. It’s the job of the concretizers (and also grantmakers to some extent) to incentivize/nudge research to be in a useful direction rather than a nonuseful direction, and for fieldbuilding to shift researchers towards more explicitly considering x-risk. But, as you say, competition can be a valuable force; if you can set the incentives right, it might not be necessary for all researchers to be caring about x-risk. If you can give them a fun problem to solve and make sure it’s actually relevant and they are only rewarded for actually relevant work, then good research could still be produced. Relevant research has been produced by the ML community before by people who weren’t explicitly thinking about x-risk (mostly “accidentally”, i.e. not because anyone who cared about x-risk told them/incentivized them to, but hopefully this will change).
Also, iterative progress involves making progress that works now but might not in the future. That’s ok, as long as some of it does in fact work in the future.
Yeah I think the difficulty of setting this up correctly is the main crux. I’m quite uncertain on this, but I’ll give the argument my model of John Wentworth makes against this:
The Trojan detection competition it does seem roughly similar to deception, and if you can find Trojan’s really well, it’s plausible that you can find deceptive alignment. However, what we really need is a way to exert optimization pressure away from deceptive regions of parameter space. And right now, afaik, we have no idea how strongly deception is favored.
I can imagine using methods from this competition to put a small amount of pressure away from this, by, e.g., restarting whenever you see deception, or running SGD on your interpreted deception. But this feels sketchy because 1) you are putting pressure on these tools, and you might just steer into regions of space where they fail, and 2) you are training a model until it becomes deceptive: eventually, a smart deceptive model will be actively trying to beat these tools.
So what I really want is understanding the generators of deceptive alignment, which could take the form of formal version of the argument given here, so that I can prevent entering the deceptive regions of parameter space in the first place.
Could you link an example? I am curious what you have in mind. I’m guessing something like the ROME paper?
Thoughts on John’s comment: this is a problem with any method for detecting deception that isn’t 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck.
Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction.
Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant.
In general, it seems better to me to evaluate research by asking “where is this taking the field/what follow-up research is this motivating?” rather than “how are the words in this paper directly useful if we had to build AGI right now?” Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I’m pretty skeptical of a lot of the direct value of empirical research.