I feel like there’s currently a wave of optimism among some AI safety researchers around transparency/interpretability, and to me it looks like another case of “optimism by default + not thinking things through”, analogous to how many people, such as Eliezer, were initially very optimistic about AGI being beneficial when they first thought of the idea. I find myself asking the sameskepticalquestions to different people who are optimistic about transparency/interpretability and not really getting good answers. Anyone want to try to convince me that I’m wrong about this?
If you expect discontinuous takeoff, or you want a proof that your AGI is safe, then I agree transparency / interpretability is unlikely to give you what you want.
If you instead expect gradual takeoff, then it seems reasonable to expect that regular engineering practices are the sort of thing you want, of which interpretability / transparency tools are probably the most obvious thing you want to try. (Red teaming would be included in this.)
However, I suspect Chris Olah, Evan Hubinger, Daniel Filan, and Matthew Barnett would all not justify interpretability / transparency on these grounds. I don’t know about Paul Christiano.
If you instead expect gradual takeoff, then it seems reasonable to expect that regular engineering practices are the sort of thing you want, of which interpretability / transparency tools are probably the most obvious thing you want to try.
I support work on interpretability/transparency, in part because I’m uncertain about discontinuous vs gradual takeoff, and in part because I’m not very optimistic about any other AI safety approach either and think we probably just need to try a whole bunch of different approaches that each have low probability of success in the hope that something (or some combination of things) works out in the end. My point was that I find the stories people tell about why they are optimistic (e.g., reverse compiling a neural network into human readable code and then using that to generate human feedback on the model’s decision-making process) to be very questionable.
Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.
(If a single failure meant that we lose, then I wouldn’t say this; so perhaps we also need to add in another claim that the first failure does not mean automatic loss. Regular engineering practices get you to high degrees of reliability, not perfect reliability.)
Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.
What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition? Aside from “single failure meant that we lose”, the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways. In this case WRT interpretability I was complaining that having humans look at reverse compiled neural networks and give “feedback on process” as part of ML training seems impractically expensive.
What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition?
Two responses:
First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
Second, is there a consensus that recommendation algorithms are net negative? Within this community, that’s probably the consensus, but I don’t think it’s a consensus more broadly. If we can’t solve the bad discourse problem, but the recommendation algorithms are still net positive overall, then you want to keep them.
(Part of the social coordination problem is building consensus that something is wrong.)
the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways.
For many ways of how they push human civilization off the rails, I would not expect transparency / interpretability to help. One example would be the scenario in which each AI is legitimately trying to help some human(s), but selection / competitive pressures on the humans lead to sacrificing all values except productivity. I’d predict that most people optimistic about transparency / interpretability would agree with at least that example.
First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
Ok, I think it makes sense to be more optimistic about transparency/interpretability allowing people to notice when something is wrong. My original complaint was about people seemingly being optimistic about using it to solve alignment, not just to notice when an AI isn’t aligned. (I didn’t state this clearly in my original comment, but the links I gave did go to posts where people seemed to be optimistic about “solving”, not just “noticing”.)
As I’ve argued before, I think a large part of solving social coordination is making sure that strategists and policy makers have correct beliefs about how difficult alignment is, which is why I was making this complaint in the first place.
I was one of those people who you asked the skeptical question to, and I feel like I have a better reply now than I did at the time. In particular, your objection was
To generalize my question, what if something goes wrong, we peek inside and find out that it’s one of the 10-15% of times when the model doesn’t agree with the known-algorithm which is used to generate the penalty term?
I agree this is an issue, but at worst it puts a bound on how well we can inspect the neural network’s behavior. In other words, it means something like, “Our model of what this neural network is doing is wrong X% of the time.” This sounds bad, but X can also be quite low. Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.
The errors that we make are potentially neutral errors, meaning that the AI could be doing something either bad or good in those intervals, but probably nothing purposely catastrophic. We can strengthen this condition by using adversarial training to purposely search for interpretations that would prioritize exposing catastrophic planning.
ETA: This is essentially why engineers don’t need to employ quantum mechanics to argue that their designs are safe. The normal models that are less computationally demanding might be less accurate, but by default engineers don’t think that their bridge is going to adversarially optimize for the (small) X% where predictions disagree. There is of course a lot of stuff to be said about when this assumption does not apply to AI designs.
Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.
I’m confused because the post you made one day later from this comment seems to argue the opposite of this. Did you change your mind in between, or am I missing something?
I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I’ve less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.
I feel like there’s currently a wave of optimism among some AI safety researchers around transparency/interpretability, and to me it looks like another case of “optimism by default + not thinking things through”, analogous to how many people, such as Eliezer, were initially very optimistic about AGI being beneficial when they first thought of the idea. I find myself asking the same skeptical questions to different people who are optimistic about transparency/interpretability and not really getting good answers. Anyone want to try to convince me that I’m wrong about this?
If you expect discontinuous takeoff, or you want a proof that your AGI is safe, then I agree transparency / interpretability is unlikely to give you what you want.
If you instead expect gradual takeoff, then it seems reasonable to expect that regular engineering practices are the sort of thing you want, of which interpretability / transparency tools are probably the most obvious thing you want to try. (Red teaming would be included in this.)
However, I suspect Chris Olah, Evan Hubinger, Daniel Filan, and Matthew Barnett would all not justify interpretability / transparency on these grounds. I don’t know about Paul Christiano.
I support work on interpretability/transparency, in part because I’m uncertain about discontinuous vs gradual takeoff, and in part because I’m not very optimistic about any other AI safety approach either and think we probably just need to try a whole bunch of different approaches that each have low probability of success in the hope that something (or some combination of things) works out in the end. My point was that I find the stories people tell about why they are optimistic (e.g., reverse compiling a neural network into human readable code and then using that to generate human feedback on the model’s decision-making process) to be very questionable.
Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.
(If a single failure meant that we lose, then I wouldn’t say this; so perhaps we also need to add in another claim that the first failure does not mean automatic loss. Regular engineering practices get you to high degrees of reliability, not perfect reliability.)
What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition? Aside from “single failure meant that we lose”, the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways. In this case WRT interpretability I was complaining that having humans look at reverse compiled neural networks and give “feedback on process” as part of ML training seems impractically expensive.
Two responses:
First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
Second, is there a consensus that recommendation algorithms are net negative? Within this community, that’s probably the consensus, but I don’t think it’s a consensus more broadly. If we can’t solve the bad discourse problem, but the recommendation algorithms are still net positive overall, then you want to keep them.
(Part of the social coordination problem is building consensus that something is wrong.)
For many ways of how they push human civilization off the rails, I would not expect transparency / interpretability to help. One example would be the scenario in which each AI is legitimately trying to help some human(s), but selection / competitive pressures on the humans lead to sacrificing all values except productivity. I’d predict that most people optimistic about transparency / interpretability would agree with at least that example.
Ok, I think it makes sense to be more optimistic about transparency/interpretability allowing people to notice when something is wrong. My original complaint was about people seemingly being optimistic about using it to solve alignment, not just to notice when an AI isn’t aligned. (I didn’t state this clearly in my original comment, but the links I gave did go to posts where people seemed to be optimistic about “solving”, not just “noticing”.)
As I’ve argued before, I think a large part of solving social coordination is making sure that strategists and policy makers have correct beliefs about how difficult alignment is, which is why I was making this complaint in the first place.
I was one of those people who you asked the skeptical question to, and I feel like I have a better reply now than I did at the time. In particular, your objection was
I agree this is an issue, but at worst it puts a bound on how well we can inspect the neural network’s behavior. In other words, it means something like, “Our model of what this neural network is doing is wrong X% of the time.” This sounds bad, but X can also be quite low. Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.
The errors that we make are potentially neutral errors, meaning that the AI could be doing something either bad or good in those intervals, but probably nothing purposely catastrophic. We can strengthen this condition by using adversarial training to purposely search for interpretations that would prioritize exposing catastrophic planning.
ETA: This is essentially why engineers don’t need to employ quantum mechanics to argue that their designs are safe. The normal models that are less computationally demanding might be less accurate, but by default engineers don’t think that their bridge is going to adversarially optimize for the (small) X% where predictions disagree. There is of course a lot of stuff to be said about when this assumption does not apply to AI designs.
I’m confused because the post you made one day later from this comment seems to argue the opposite of this. Did you change your mind in between, or am I missing something?
I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I’ve less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.