Rohin Shah comments on How To Go From Interpretability To Alignment: Just Retarget The Search

Rohin Shah 15 Aug 2022 22:05 UTC
LW: 7 AF: 6
0
AF
… Interesting. I’ve been thinking we were talking about (2) this entire time, since on my understanding of “mesa optimizers”, (1) is not a mesa optimizer (what would its mesa objective be?).
If we’re imagining systems that look more like (1) I’m a lot more confused about how “retarget the search” is supposed to work. There’s clearly some part of the AI system (or the human, in the analogy) that is deciding how to retarget the search on the fly—is your proposal that we just chop that part off somehow, and replace it with a hardcoded concept of “human values” (or “user intent” or whatever)? If that sort of thing doesn’t hamstring the AI, why didn’t gradient descent do the same thing, except replacing it with a hardcoded concept of “reward” (which presumably a somewhat smart AGI would have)?
What links here?
- Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
- johnswentworth 15 Aug 2022 22:42 UTC
  LW: 16 AF: 10
  6
  AF Parent
  So, part of the reason we expect a retargetable search process in the first place is that it’s useful for the AI to recursively call it with new subproblems on the fly; recursive search on subproblems is a useful search technique. What we actually want to retarget is not every instance of the search process, but just the “outermost call”; we still want it to be able to make recursive calls to the search process while solving our chosen problem.
  What links here?
  - Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
  - Rohin Shah 16 Aug 2022 7:53 UTC
    LW: 10 AF: 8
    3
    AF Parent
    Okay, I think this is a plausible architecture that a learned program could have, and I don’t see super strong reasons for “retarget the search” to fail on this particular architecture (though I do expect that if you flesh it out you’ll run into more problems, e.g. I’m not clear on where “concepts” live in this architecture and I could imagine that poses problems for retargeting the search).
    Personally I still expect systems to be significantly more tuned to the domains they were trained on, with search playing a more cursory role (which is also why I expect to have trouble retargeting a human’s search). But I agree that my reason (2) above doesn’t clearly apply to this architecture. I think the recursive aspect of the search was the main thing I wasn’t thinking about when I wrote my original comment.
    What links here?
    Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)