If you expect discontinuous takeoff, or you want a proof that your AGI is safe, then I agree transparency / interpretability is unlikely to give you what you want.
If you instead expect gradual takeoff, then it seems reasonable to expect that regular engineering practices are the sort of thing you want, of which interpretability / transparency tools are probably the most obvious thing you want to try. (Red teaming would be included in this.)
However, I suspect Chris Olah, Evan Hubinger, Daniel Filan, and Matthew Barnett would all not justify interpretability / transparency on these grounds. I don’t know about Paul Christiano.
If you instead expect gradual takeoff, then it seems reasonable to expect that regular engineering practices are the sort of thing you want, of which interpretability / transparency tools are probably the most obvious thing you want to try.
I support work on interpretability/transparency, in part because I’m uncertain about discontinuous vs gradual takeoff, and in part because I’m not very optimistic about any other AI safety approach either and think we probably just need to try a whole bunch of different approaches that each have low probability of success in the hope that something (or some combination of things) works out in the end. My point was that I find the stories people tell about why they are optimistic (e.g., reverse compiling a neural network into human readable code and then using that to generate human feedback on the model’s decision-making process) to be very questionable.
Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.
(If a single failure meant that we lose, then I wouldn’t say this; so perhaps we also need to add in another claim that the first failure does not mean automatic loss. Regular engineering practices get you to high degrees of reliability, not perfect reliability.)
Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.
What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition? Aside from “single failure meant that we lose”, the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways. In this case WRT interpretability I was complaining that having humans look at reverse compiled neural networks and give “feedback on process” as part of ML training seems impractically expensive.
What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition?
Two responses:
First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
Second, is there a consensus that recommendation algorithms are net negative? Within this community, that’s probably the consensus, but I don’t think it’s a consensus more broadly. If we can’t solve the bad discourse problem, but the recommendation algorithms are still net positive overall, then you want to keep them.
(Part of the social coordination problem is building consensus that something is wrong.)
the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways.
For many ways of how they push human civilization off the rails, I would not expect transparency / interpretability to help. One example would be the scenario in which each AI is legitimately trying to help some human(s), but selection / competitive pressures on the humans lead to sacrificing all values except productivity. I’d predict that most people optimistic about transparency / interpretability would agree with at least that example.
First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
Ok, I think it makes sense to be more optimistic about transparency/interpretability allowing people to notice when something is wrong. My original complaint was about people seemingly being optimistic about using it to solve alignment, not just to notice when an AI isn’t aligned. (I didn’t state this clearly in my original comment, but the links I gave did go to posts where people seemed to be optimistic about “solving”, not just “noticing”.)
As I’ve argued before, I think a large part of solving social coordination is making sure that strategists and policy makers have correct beliefs about how difficult alignment is, which is why I was making this complaint in the first place.
If you expect discontinuous takeoff, or you want a proof that your AGI is safe, then I agree transparency / interpretability is unlikely to give you what you want.
If you instead expect gradual takeoff, then it seems reasonable to expect that regular engineering practices are the sort of thing you want, of which interpretability / transparency tools are probably the most obvious thing you want to try. (Red teaming would be included in this.)
However, I suspect Chris Olah, Evan Hubinger, Daniel Filan, and Matthew Barnett would all not justify interpretability / transparency on these grounds. I don’t know about Paul Christiano.
I support work on interpretability/transparency, in part because I’m uncertain about discontinuous vs gradual takeoff, and in part because I’m not very optimistic about any other AI safety approach either and think we probably just need to try a whole bunch of different approaches that each have low probability of success in the hope that something (or some combination of things) works out in the end. My point was that I find the stories people tell about why they are optimistic (e.g., reverse compiling a neural network into human readable code and then using that to generate human feedback on the model’s decision-making process) to be very questionable.
Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.
(If a single failure meant that we lose, then I wouldn’t say this; so perhaps we also need to add in another claim that the first failure does not mean automatic loss. Regular engineering practices get you to high degrees of reliability, not perfect reliability.)
What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition? Aside from “single failure meant that we lose”, the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways. In this case WRT interpretability I was complaining that having humans look at reverse compiled neural networks and give “feedback on process” as part of ML training seems impractically expensive.
Two responses:
First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
Second, is there a consensus that recommendation algorithms are net negative? Within this community, that’s probably the consensus, but I don’t think it’s a consensus more broadly. If we can’t solve the bad discourse problem, but the recommendation algorithms are still net positive overall, then you want to keep them.
(Part of the social coordination problem is building consensus that something is wrong.)
For many ways of how they push human civilization off the rails, I would not expect transparency / interpretability to help. One example would be the scenario in which each AI is legitimately trying to help some human(s), but selection / competitive pressures on the humans lead to sacrificing all values except productivity. I’d predict that most people optimistic about transparency / interpretability would agree with at least that example.
Ok, I think it makes sense to be more optimistic about transparency/interpretability allowing people to notice when something is wrong. My original complaint was about people seemingly being optimistic about using it to solve alignment, not just to notice when an AI isn’t aligned. (I didn’t state this clearly in my original comment, but the links I gave did go to posts where people seemed to be optimistic about “solving”, not just “noticing”.)
As I’ve argued before, I think a large part of solving social coordination is making sure that strategists and policy makers have correct beliefs about how difficult alignment is, which is why I was making this complaint in the first place.