Rohin Shah comments on How To Go From Interpretability To Alignment: Just Retarget The Search

Rohin Shah 15 Aug 2022 6:47 UTC
LW: 11 AF: 10
3
AF
(Note to readers: here’s another post (with comments from John) on the same topic, which I only just saw.)
I imagine two different kinds of AI systems you might be imagining:
1. An AI system that has a “subroutine” that runs A* search given a problem specification. The AI system works by formulating useful subgoals, converting those into A* problem specifications + heuristics, uses the A* subroutine, and then executes the result.
2. An AI system that literally is A* search. The AI has (in its weights, if it is a learned neural net) a high-level “state space of the universe”, a high-level “conceptual actions” space, an ability to predict the next high-level state given a previous state + conceptual action, and some goal function (= the mesa-objective). Given an input, the AI converts it into a high-level state, and runs A* with that state as the input, takes the resulting plan and executes the first action of the plan.
In (1), it seems like the major alignment work is in aligning the part of the AI system that formulates subgoals, problem specification, and heuristics, where it is not clear that “retarget the search” would work. (You could also try to excise the A* subroutine and use that as a general-purpose problem solver, but then you have to tune the heuristic manually; maybe you could excise the A* subroutine and the part that designs the heuristic, if you were lucky enough that those were fully decoupled from the subgoal-choosing part of the system.)
In (2), I don’t know why you expect to get general-purpose search instead of a very complex heuristic that’s very specific to the mesa objective. There is only ever one goal that the A* search has to optimize for; why wouldn’t gradient descent embed a bunch of goal-specific heuristics that improve efficiency? Are you saying that such heuristics don’t exist?
Separately: do you think we could easily “retarget the search” for an adult human, if we had mechanistic interpretability + edit access for the human’s brain? I’d expect “no”.
What links here?
- Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
- johnswentworth 15 Aug 2022 17:08 UTC
  LW: 13 AF: 9
  6
  AF Parent
  I’m imagining roughly (1), though with some caveats:
  - Of course it probably wouldn’t literally be A* search
  - Either the heuristic-generation is internal to the search subroutine, or it’s using a standard library of general-purpose heuristics for everything (or some combination of the two).
  - A lot of the subgoal formulation is itself internal to the search (i.e. recursively searching on subproblems is a standard search technique).
  I do indeed expect that the major alignment work is in formulating problem specification, and possibly subgoals/heuristics (depending on how much of that is automagically handled by instrumental convergence/natural abstraction). That’s basically the conclusion of the OP: outer alignment is still hard, but we can totally eliminate the inner alignment problem by retargeting the search.
  Separately: do you think we could easily “retarget the search” for an adult human, if we had mechanistic interpretability + edit access for the human’s brain? I’d expect “no”.
  I expect basically “yes”, although the result would be something quite different from a human.
  We can already give humans quite arbitrary tasks/jobs/objectives, and the humans will go figure out how to do it. I’m currently working on a post on this, and my opening example is Benito’s job; here are some things he’s had to do over the past couple years:
  - build a prototype of an office
  - resolve neighbor complaints at a party
  - find housing for 13 people with 2 days notice
  - figure out an invite list for 100+ people for an office
  - deal with people emailing a funder trying to get him defunded
  - set moderation policies for LessWrong
  - write public explanations of grantmaking decisions
  - organize weekly online zoom events
  - ship books internationally by Christmas
  - moderate online debates
  - do April Fools’ Jokes on Lesswrong
  - figure out which of 100s of applicants to do trial hires with
  So there’s clearly a retargetable search subprocess in there, and we do in fact retarget it on different tasks all the time.
  That said, in practice most humans seem to spend most of their time not really using the retargetable search process much; most people mostly just operate out of cache, and if pressed they’re unsure what to point the retargetable search process at. If we were to hardwire a human’s search process to a particular target, they’d single-mindedly pursue that one target (and subgoals thereof); that’s quite different from normal humans.
  What links here?
  - Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
  - Rohin Shah 15 Aug 2022 22:05 UTC
    LW: 7 AF: 6
    0
    AF Parent
    … Interesting. I’ve been thinking we were talking about (2) this entire time, since on my understanding of “mesa optimizers”, (1) is not a mesa optimizer (what would its mesa objective be?).
    If we’re imagining systems that look more like (1) I’m a lot more confused about how “retarget the search” is supposed to work. There’s clearly some part of the AI system (or the human, in the analogy) that is deciding how to retarget the search on the fly—is your proposal that we just chop that part off somehow, and replace it with a hardcoded concept of “human values” (or “user intent” or whatever)? If that sort of thing doesn’t hamstring the AI, why didn’t gradient descent do the same thing, except replacing it with a hardcoded concept of “reward” (which presumably a somewhat smart AGI would have)?
    What links here?
    Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
    - johnswentworth 15 Aug 2022 22:42 UTC
      LW: 16 AF: 10
      6
      AF Parent
      So, part of the reason we expect a retargetable search process in the first place is that it’s useful for the AI to recursively call it with new subproblems on the fly; recursive search on subproblems is a useful search technique. What we actually want to retarget is not every instance of the search process, but just the “outermost call”; we still want it to be able to make recursive calls to the search process while solving our chosen problem.
      What links here?
      Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
      - Rohin Shah 16 Aug 2022 7:53 UTC
        LW: 10 AF: 8
        3
        AF Parent
        Okay, I think this is a plausible architecture that a learned program could have, and I don’t see super strong reasons for “retarget the search” to fail on this particular architecture (though I do expect that if you flesh it out you’ll run into more problems, e.g. I’m not clear on where “concepts” live in this architecture and I could imagine that poses problems for retargeting the search).
        Personally I still expect systems to be significantly more tuned to the domains they were trained on, with search playing a more cursory role (which is also why I expect to have trouble retargeting a human’s search). But I agree that my reason (2) above doesn’t clearly apply to this architecture. I think the recursive aspect of the search was the main thing I wasn’t thinking about when I wrote my original comment.
        What links here?
        Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
  - Evan R. Murphy 16 Aug 2022 23:09 UTC
    4 points
    0
    Parent
    Link to the post John mentions in the parent comment: https://www.alignmentforum.org/posts/6mysMAqvo9giHC4iX/what-s-general-purpose-search-and-why-might-we-expect-to-see