Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.
What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition? Aside from “single failure meant that we lose”, the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways. In this case WRT interpretability I was complaining that having humans look at reverse compiled neural networks and give “feedback on process” as part of ML training seems impractically expensive.
What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition?
Two responses:
First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
Second, is there a consensus that recommendation algorithms are net negative? Within this community, that’s probably the consensus, but I don’t think it’s a consensus more broadly. If we can’t solve the bad discourse problem, but the recommendation algorithms are still net positive overall, then you want to keep them.
(Part of the social coordination problem is building consensus that something is wrong.)
the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways.
For many ways of how they push human civilization off the rails, I would not expect transparency / interpretability to help. One example would be the scenario in which each AI is legitimately trying to help some human(s), but selection / competitive pressures on the humans lead to sacrificing all values except productivity. I’d predict that most people optimistic about transparency / interpretability would agree with at least that example.
First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
Ok, I think it makes sense to be more optimistic about transparency/interpretability allowing people to notice when something is wrong. My original complaint was about people seemingly being optimistic about using it to solve alignment, not just to notice when an AI isn’t aligned. (I didn’t state this clearly in my original comment, but the links I gave did go to posts where people seemed to be optimistic about “solving”, not just “noticing”.)
As I’ve argued before, I think a large part of solving social coordination is making sure that strategists and policy makers have correct beliefs about how difficult alignment is, which is why I was making this complaint in the first place.
What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition? Aside from “single failure meant that we lose”, the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways. In this case WRT interpretability I was complaining that having humans look at reverse compiled neural networks and give “feedback on process” as part of ML training seems impractically expensive.
Two responses:
First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
Second, is there a consensus that recommendation algorithms are net negative? Within this community, that’s probably the consensus, but I don’t think it’s a consensus more broadly. If we can’t solve the bad discourse problem, but the recommendation algorithms are still net positive overall, then you want to keep them.
(Part of the social coordination problem is building consensus that something is wrong.)
For many ways of how they push human civilization off the rails, I would not expect transparency / interpretability to help. One example would be the scenario in which each AI is legitimately trying to help some human(s), but selection / competitive pressures on the humans lead to sacrificing all values except productivity. I’d predict that most people optimistic about transparency / interpretability would agree with at least that example.
Ok, I think it makes sense to be more optimistic about transparency/interpretability allowing people to notice when something is wrong. My original complaint was about people seemingly being optimistic about using it to solve alignment, not just to notice when an AI isn’t aligned. (I didn’t state this clearly in my original comment, but the links I gave did go to posts where people seemed to be optimistic about “solving”, not just “noticing”.)
As I’ve argued before, I think a large part of solving social coordination is making sure that strategists and policy makers have correct beliefs about how difficult alignment is, which is why I was making this complaint in the first place.