Rob Bensinger comments on AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

Rob Bensinger 22 Apr 2020 20:31 UTC
LW: 17 AF: 8
AF
I emailed Luke some corrections to the transcript above, most of which are now implemented. The changes that seemed least trivial to me (noted in underline):
- Rohin Shah: [...] I think intelligence explosion microeconomics was good. I think AI alignment, why it’s hard and where to start, was misleading.
  → Rohin Shah: [...] I think Intelligence Explosion Microeconomics was good. I think AI Alignment: Why It’s Hard and Where to Start, was misleading.
- Buck Shlegeris: Me and Rohin are going to disagree about this. I think that trying to model human preferences as a utility function is really dumb and bad and will not help you do things that are useful. I don’t know. If I want to make an AI that’s incredibly good at recommending me movies that I’m going to like, some kind of value learning thing where it tries to learn my utility function over movies is supposedly a good idea.
  → Buck Shlegeris: Me and Rohin are going to disagree about this. I think that trying to model human preferences as a utility function is really dumb and bad and will not help you do things that are useful. I don’t know; If I want to make an AI that’s incredibly good at recommending me movies that I’m going to like, some kind of value learning thing where it tries to learn my utility function over movies is plausibly a good idea.
- Rohin Shah: An optimal agent playing a CIRL (cooperative inverse reinforcement learning) game. I agree with your argument. If you take an optimality as defined in the cooperative inverse reinforcement learning paper and it’s playing over a long period of time, then yes, it’s definitely going to prefer to keep itself in charge rather than a different AI system that would infer values in a different way.
  → Rohin Shah: For an optimal agent playing a CIRL (cooperative inverse reinforcement learning) game, I agree with your argument. If you take optimality as defined in the cooperative inverse reinforcement learning paper and it’s playing over a long period of time, then yes, it’s definitely going to prefer to keep itself in charge rather than a different AI system that would infer values in a different way.
- Buck Shlegeris: The problem isn’t just the capabilities’ problem.
  → Buck Shlegeris: The problem isn’t just the capabilities problem.
- You still have the is-ought problem where the facts about the brain are as facts and how you translate that into odd facts is going to involve some assumptions.
  → You still have the is-ought problem where the facts about the brain are “is” facts and how you translate that into “ought” facts is going to involve some assumptions.
- great reflection / longer reflection
  → long reflection
The following changes aren’t implemented yet:
- Rohin Shah: It’s more like there are just infinitely many possible pairs of planning functions and utility functions that exactly predict human behavior. Even if it were true that humans were expected utility maximizers which Buck is arguing, we’re not. I agree with him. There is a planning function that’s like
  → Rohin Shah: It’s more like there are just infinitely many possible pairs of planning functions and utility functions that exactly predict human behavior. Even if it were true that humans were expected utility maximizers (which Buck is arguing we’re not, and I agree with him), there is a planning function that’s like
- Rohin Shah: [...] The hope here is that since both sides of the debate can point out flaws on the other side’s arguments, they’re both very powerful AI systems. Such a set up can use a human judge to train far more capable agents while still incentivizing the agents to provide honest true information.
  → Rohin Shah: [...] The hope here is that since both sides of the debate can point out flaws on the other side’s arguments—they’re both very powerful AI systems—such a set up can use a human judge to train far more capable agents while still incentivizing the agents to provide honest true information.
- Buck Shlegeris: I could be totally wrong about this, and correct me if I’m wrong, my sense is that you have to be able to back out the agent’s utility function or its models of the world. Which seems like it’s assuming a particular path for AI development, which doesn’t seem to me particularly likely.
  → Buck Shlegeris: I could be totally wrong about this, and correct me if I’m wrong, my sense is that you have to be able to back out the agent’s utility function or its models of the world. Which seems like it’s assuming a particular path for AI development which doesn’t seem to me particularly likely.
- Buck Shlegeris: I think that that does not sound like a good plan. I don’t think that we should expect our AI systems to be structured that way in the future.
  Rohin Shah: Plausibly and you have to do this with natural language or something.
  → Buck Shlegeris: I think that that does not sound like a good plan. I don’t think that we should expect our AI systems to be structured that way in the future.
  Rohin Shah: Plausibly we have to do this with natural language or something.
- Rohin Shah: I mean it’s definitely not conceptually neat and elegant in the sense of it’s not attacking the underlying problem and in a problem setting where you expect adversarial optimization type dynamics. Conceptual elegance actually does count for quite a lot in whether or not you believe your solution will work.
  → Rohin Shah: I mean, it’s definitely not conceptually neat and elegant in the sense of it’s not attacking the underlying problem. And in a problem setting where you expect adversarial optimization type dynamics, conceptual elegance actually does count for quite a lot in whether or not you believe your solution will work.
- Rob Bensinger 22 Apr 2020 21:06 UTC
  LW: 8 AF: 5
  AF Parent
  More links:
  - I googled ‘daniel ellsberg nuclear first strikes’ and found U.S. Planned Nuclear First Strike to Destroy Soviets and China – Daniel Ellsberg on RAI (6/13) and U.S. Refuses to Adopt a Nuclear Weapon No First Use Pledge – Daniel Ellsberg on RAI (7/13).
  - Rohin Shah mentions a paper arguing image classifiers vulnerable to adversarial examples are “picking up on real imperceptible features that do generalize to the test set, that humans can’t detect”. This might be the MIT paper Adversarial Examples are not Bugs, they are Features.
  - MIRI’s AI Risk for Computer Scientists workshop. Workshops are on hold due to COVID-19, but you’re welcome to apply, get in touch with us, etc.
  - Rohin Shah 22 Apr 2020 21:50 UTC
    LW: 6 AF: 4
    AF Parent
    This might be the MIT paper Adversarial Examples are not Bugs, they are Features.
    That is in fact what I meant :)