TurnTrout comments on Paul’s research agenda FAQ

TurnTrout 20 Sep 2022 6:48 UTC
LW: 2 AF: 2
AF
You seem to mostly be imagining a third category:
3. If you optimize a model to be corrigible in one situation, how likely is it to still be corrigible in a new situation?
I don’t care about question 3. It’s been more than 4 years since I even seriously discussed the possibility of learning on a mechanism like that, and even at that point it was not a very serious discussion.
“Don’t care” is quite strong. If you still hold this view—why don’t you care about 3? (Curious to hear from other people who basically don’t care about 3, either.)
- paulfchristiano 20 Sep 2022 17:20 UTC
  LW: 4 AF: 4
  AF Parent
  Yeah, “don’t care” is much too strong. This comment was just meant in the context of the current discussion. I could instead say:
  The kind of alignment agenda that I’m working on, and the one we’re discussing here, is not relying on this kind of generalization of corrigibility. This kind of generalization isn’t why we are talking about corrigibility.
  However, I agree that there are lots of approaches to building AI that rely on some kind of generalization of corrigibility, and that studying those is interesting and I do care about how that goes.
  In the context of this discussion I also would have said that I don’t care about whether honesty generalizes. But that’s also something I do care about even though it’s not particularly relevant to this agenda (because the agenda is attempting to solve alignment under considerably more pessimistic assumptions).