Wei Dai comments on AI Alignment Open Thread October 2019

Wei Dai 3 Nov 2019 18:54 UTC
LW: 18 AF: 10
AF
I feel like there’s currently a wave of optimism among some AI safety researchers around transparency/interpretability, and to me it looks like another case of “optimism by default + not thinking things through”, analogous to how many people, such as Eliezer, were initially very optimistic about AGI being beneficial when they first thought of the idea. I find myself asking the same skeptical questions to different people who are optimistic about transparency/interpretability and not really getting good answers. Anyone want to try to convince me that I’m wrong about this?
What links here?
- Will transparency help catch deception? Perhaps not by Matthew Barnett (4 Nov 2019 20:52 UTC; 43 points)
- Rohin Shah 4 Nov 2019 23:29 UTC
  LW: 6 AF: 4
  AF Parent
  If you expect discontinuous takeoff, or you want a proof that your AGI is safe, then I agree transparency / interpretability is unlikely to give you what you want.
  If you instead expect gradual takeoff, then it seems reasonable to expect that regular engineering practices are the sort of thing you want, of which interpretability / transparency tools are probably the most obvious thing you want to try. (Red teaming would be included in this.)
  However, I suspect Chris Olah, Evan Hubinger, Daniel Filan, and Matthew Barnett would all not justify interpretability / transparency on these grounds. I don’t know about Paul Christiano.
  What links here?
  - Raemon's comment on Chris Olah’s views on AGI safety by evhub (4 Nov 2019 23:49 UTC; 8 points)
  - Wei Dai 5 Nov 2019 4:50 UTC
    LW: 12 AF: 6
    AF Parent
    
    If you instead expect gradual takeoff, then it seems reasonable to expect that regular engineering practices are the sort of thing you want, of which interpretability / transparency tools are probably the most obvious thing you want to try.
    
    I support work on interpretability/transparency, in part because I’m uncertain about discontinuous vs gradual takeoff, and in part because I’m not very optimistic about any other AI safety approach either and think we probably just need to try a whole bunch of different approaches that each have low probability of success in the hope that something (or some combination of things) works out in the end. My point was that I find the stories people tell about why they are optimistic (e.g., reverse compiling a neural network into human readable code and then using that to generate human feedback on the model’s decision-making process) to be very questionable.
    - Rohin Shah 5 Nov 2019 8:05 UTC
      LW: 2 AF: 2
      AF Parent
      Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.
      (If a single failure meant that we lose, then I wouldn’t say this; so perhaps we also need to add in another claim that the first failure does not mean automatic loss. Regular engineering practices get you to high degrees of reliability, not perfect reliability.)
      - Wei Dai 6 Nov 2019 19:49 UTC
        LW: 4 AF: 3
        AF Parent
        
        Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.
        
        What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition? Aside from “single failure meant that we lose”, the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways. In this case WRT interpretability I was complaining that having humans look at reverse compiled neural networks and give “feedback on process” as part of ML training seems impractically expensive.
        
        Rohin Shah 6 Nov 2019 23:25 UTC
        LW: 2 AF: 2
        AF Parent
        What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition?
        Two responses:
        First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
        Second, is there a consensus that recommendation algorithms are net negative? Within this community, that’s probably the consensus, but I don’t think it’s a consensus more broadly. If we can’t solve the bad discourse problem, but the recommendation algorithms are still net positive overall, then you want to keep them.
        (Part of the social coordination problem is building consensus that something is wrong.)
        the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways.
        For many ways of how they push human civilization off the rails, I would not expect transparency / interpretability to help. One example would be the scenario in which each AI is legitimately trying to help some human(s), but selection / competitive pressures on the humans lead to sacrificing all values except productivity. I’d predict that most people optimistic about transparency / interpretability would agree with at least that example.
        Wei Dai 10 Nov 2019 7:59 UTC
        LW: 4 AF: 3
        AF Parent
        
        First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.
        
        Ok, I think it makes sense to be more optimistic about transparency/interpretability allowing people to notice when something is wrong. My original complaint was about people seemingly being optimistic about using it to solve alignment, not just to notice when an AI isn’t aligned. (I didn’t state this clearly in my original comment, but the links I gave did go to posts where people seemed to be optimistic about “solving”, not just “noticing”.)
        
        As I’ve argued before, I think a large part of solving social coordination is making sure that strategists and policy makers have correct beliefs about how difficult alignment is, which is why I was making this complaint in the first place.
- Matthew Barnett 3 Nov 2019 21:59 UTC
  LW: 3 AF: 2
  AF Parent
  I was one of those people who you asked the skeptical question to, and I feel like I have a better reply now than I did at the time. In particular, your objection was
  To generalize my question, what if something goes wrong, we peek inside and find out that it’s one of the 10-15% of times when the model doesn’t agree with the known-algorithm which is used to generate the penalty term?
  I agree this is an issue, but at worst it puts a bound on how well we can inspect the neural network’s behavior. In other words, it means something like, “Our model of what this neural network is doing is wrong X% of the time.” This sounds bad, but X can also be quite low. Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.
  The errors that we make are potentially neutral errors, meaning that the AI could be doing something either bad or good in those intervals, but probably nothing purposely catastrophic. We can strengthen this condition by using adversarial training to purposely search for interpretations that would prioritize exposing catastrophic planning.
  ETA: This is essentially why engineers don’t need to employ quantum mechanics to argue that their designs are safe. The normal models that are less computationally demanding might be less accurate, but by default engineers don’t think that their bridge is going to adversarially optimize for the (small) X% where predictions disagree. There is of course a lot of stuff to be said about when this assumption does not apply to AI designs.
  - Wei Dai 5 Nov 2019 3:59 UTC
    LW: 4 AF: 3
    AF Parent
    
    Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.
    
    I’m confused because the post you made one day later from this comment seems to argue the opposite of this. Did you change your mind in between, or am I missing something?
    - Matthew Barnett 5 Nov 2019 4:29 UTC
      LW: 3 AF: 2
      AF Parent
      I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I’ve less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.