Rohin Shah comments on How To Go From Interpretability To Alignment: Just Retarget The Search

Rohin Shah 12 Aug 2022 11:16 UTC
LW: 30 AF: 19
4
AF
Definitely agree that “Retarget the Search” is an interesting baseline alignment method you should be considering.
I like what you call “complicated schemes” over “retarget the search” for two main reasons:
1. They don’t rely on the “mesa-optimizer assumption” that the model is performing retargetable search (which I think will probably be false in the systems we care about).
2. They degrade gracefully with worse interpretability tools, e.g. in debate, even if the debaters can only credibly make claims about whether particular neurons are activated, they can still stay stuff like “look my opponent is thinking about synthesizing pathogens, probably it is hoping to execute a treacherous turn”, whereas “Retarget the Search” can’t use this weaker interpretability at all. (Depending on background assumptions you might think this doesn’t reduce x-risk at all; that could also be a crux.)
What links here?
- johnswentworth 12 Aug 2022 15:52 UTC
  LW: 10 AF: 8
  0
  AF Parent
  I indeed think those are the relevant cruxes.
  What links here?
  - Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
- Evan R. Murphy 12 Aug 2022 18:23 UTC
  LW: 9 AF: 7
  0
  AF Parent
  They don’t rely on the “mesa-optimizer assumption” that the model is performing retargetable search (which I think will probably be false in the systems we care about).
  Why do you think we probably won’t end up with mesa-optimizers in the systems we care about?
  Curious about both which systems you think we’ll care about (e.g. generative models, RL-based agents, etc.) and why you don’t think mesa-optimization is a likely emergent property for very scaled-up ML models.
  What links here?
  - Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
  - Rohin Shah 13 Aug 2022 3:48 UTC
    LW: 28 AF: 12
    12
    AF Parent
    It’s a very specific claim about how intelligence works, so gets a low prior, from which I don’t update much (because it seems to me we know very little about how intelligence works structurally and the arguments given in favor seem like relatively weak considerations).
    Search is computationally inefficient relative to heuristics, and we’ll be selecting really hard on computational efficiency (the model can’t just consider 10^100 plans and choose the best one when it only has 10^15 flops to work with). It seems very plausible that the model considers, say, 10 plans and chooses the best one, or even 10^6 plans, but then most of the action is in which plans were generated in the first place and “retarget the search” doesn’t necessarily solve your problem.
    I’m not thinking much about whether we’re considering generative models vs RL-based agents for this particular question (though generally I tend to think about foundation models finetuned from human feedback).
    What links here?
    Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
    sunwillrise's comment on The Field of AI Alignment: A Postmortem, and What To Do About It by johnswentworth (27 Dec 2024 9:21 UTC; 16 points)
    - Oliver Sourbut 22 Aug 2022 11:12 UTC
      6 points
      2
      Parent
      
      most of the action is in which plans were generated in the first place and “retarget the search” doesn’t necessarily solve your problem
      
      I definitely buy this and I think the thread under this between you and John is a useful elaboration.
      
      The thing that generates the proposals has to do most of the heavy lifting in any interestingly-large problem. e.g. I would argue most of the heavy lifting^[1] of AlphaGo and that crowd is done by the fact that the atomic actions are all already ‘interestingly consequential’ (i.e. the proposal generation doesn’t have to consider millisecond muscle twitches but rather whole ‘moves’, a short string of which is genuinely consequential in-context).
      
      Nevertheless I reasonably strongly think that something of the ‘retargetable search’ flavour is a useful thing to expect, look for, and attempt to control.
      
      For one, once you have proposals which are any kind of good at all, running a couple of OOMs of plan selection can buy you a few standard deviations of plan quality, provided you can evaluate plans ex ante better than randomly, which is just generically applicable and useful. But this isn’t the main thing, because with just that picture we’re still back to most of the action being the generator/heuristics.
      
      The main things are that
      
      (as John pointed out) recursive-ish generic planning is enormously useful and general, and implies at least some degree of retargetability.
      (this is shaky and insidey) how do you arrive at the good heuristics/generators? It’s something like
      ‘magic abstraction from relevantly-similar experience’
      ‘magic recomposition of abstractions’
      how do you get relevantly-similar experience?
      it’s ‘easy’ for ‘easy’ problems (e.g. low dimensional, defined action-space already ‘interestingly consequential’, someone already collected a dataset of examples, …)
      maybe one or more of these will apply to all the necessary pieces for PASTA-like AGI, but I doubt it
      
      what about for ‘hard’ problems (e.g. high dimensional, action-space not pre-fitted to the problem, few or no existing examples)?
      you need to ‘deliberately explore’ aka ‘do science’ aka ‘experiment’
      recursive-ish generic planning also looks to me like a good tool (‘the right/only tool’?) for pulling this off!
      cf $P_{2} B$ : Plan to $P_{2} B$ Better and my restatement of instrumental convergence
      
      ↩︎
      (this is an entirely unfair defamation of Silver et al which I feel the need to qualify is at least partly rhetorical and not in fact my entire take on the matter)
      
      What links here?
      Oliver Sourbut's comment on How To Go From Interpretability To Alignment: Just Retarget The Search by johnswentworth (22 Aug 2022 11:27 UTC; 5 points)
      Oliver Sourbut's comment on Deconfusing Direct vs Amortised Optimization by beren (4 Jul 2023 16:57 UTC; 1 point)
    - johnswentworth 14 Aug 2022 1:37 UTC
      LW: 1 AF: 3
      3
      AF Parent
      I am very confused by (2). It sounds like you are imagining that search necessarily means brute-force search (i.e. guess-and-check)? Like non-brute-force search is just not a thing? And therefore heuristics are necessarily a qualitatively different thing from search? But I don’t think you’re young enough to have never seen A* search, so presumably you know that formal heuristic search is a thing, and how to use relaxation to generate heuristics. What exactly do you imagine that the word “search” refers to?
      What links here?
      Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
      - Rohin Shah 14 Aug 2022 17:28 UTC
        LW: 8 AF: 8
        3
        AF Parent
        I’ve definitely seen A* search and know how it works. I meant to allude to it (and lots of other algorithms that involve a clear goal) with this part:
        It seems very plausible that the model considers, say, 10 plans and chooses the best one, or even 10^6 plans, but then most of the action is in which plans were generated in the first place and “retarget the search” doesn’t necessarily solve your problem.
        If your AGI is doing an A* search, then I think “retarget the search” is not a great strategy, because you have to change both the goal specification and the heuristic, and it’s really unclear how you would change the heuristic even given a solution to outer alignment (because A* heuristics are incredibly specific to the setting and goal, and presumably have to become way more specialized than they are today in order for it to be more powerful than what we can do today).
        What links here?
        Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
        johnswentworth 14 Aug 2022 17:38 UTC
        LW: 6 AF: 5
        0
        AF Parent
        That’s what relaxation-based methods are for; they automatically generate heuristics for A* search. For instance, in a maze, it’s very easy to find a solution if we relax all the “can’t cross this wall” constraints, and that yields the Euclidean distance heuristic. Also, to a large extent, those heuristics tend to depend on the environment but not on the goal—for instance, in the case of Euclidean distance in a maze, the heuristic applies to any pathfinding problem between two points (and probably many other problems too), not just to whatever particular start and end points the particular maze highlights. We can also view things like instrumentally convergent subgoals or natural abstractions as likely environment-specific (but not goal-specific) heuristics.
        Those are the sort of pieces I imagine showing up as part of “general-purpose search” in trained systems: general methods for generating heuristics for a wide variety of goals, as well as some hard-coded environment-specific (but not goal-specific) heuristics.
        What links here?
        Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
        Rohin Shah 15 Aug 2022 6:47 UTC
        LW: 12 AF: 10
        4
        AF Parent
        (Note to readers: here’s another post (with comments from John) on the same topic, which I only just saw.)
        I imagine two different kinds of AI systems you might be imagining:
        An AI system that has a “subroutine” that runs A* search given a problem specification. The AI system works by formulating useful subgoals, converting those into A* problem specifications + heuristics, uses the A* subroutine, and then executes the result.
        An AI system that literally is A* search. The AI has (in its weights, if it is a learned neural net) a high-level “state space of the universe”, a high-level “conceptual actions” space, an ability to predict the next high-level state given a previous state + conceptual action, and some goal function (= the mesa-objective). Given an input, the AI converts it into a high-level state, and runs A* with that state as the input, takes the resulting plan and executes the first action of the plan.
        In (1), it seems like the major alignment work is in aligning the part of the AI system that formulates subgoals, problem specification, and heuristics, where it is not clear that “retarget the search” would work. (You could also try to excise the A* subroutine and use that as a general-purpose problem solver, but then you have to tune the heuristic manually; maybe you could excise the A* subroutine and the part that designs the heuristic, if you were lucky enough that those were fully decoupled from the subgoal-choosing part of the system.)
        In (2), I don’t know why you expect to get general-purpose search instead of a very complex heuristic that’s very specific to the mesa objective. There is only ever one goal that the A* search has to optimize for; why wouldn’t gradient descent embed a bunch of goal-specific heuristics that improve efficiency? Are you saying that such heuristics don’t exist?
        Separately: do you think we could easily “retarget the search” for an adult human, if we had mechanistic interpretability + edit access for the human’s brain? I’d expect “no”.
        What links here?
        Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
        johnswentworth 15 Aug 2022 17:08 UTC
        LW: 14 AF: 9
        7
        AF Parent
        I’m imagining roughly (1), though with some caveats:
        Of course it probably wouldn’t literally be A* search
        Either the heuristic-generation is internal to the search subroutine, or it’s using a standard library of general-purpose heuristics for everything (or some combination of the two).
        A lot of the subgoal formulation is itself internal to the search (i.e. recursively searching on subproblems is a standard search technique).
        I do indeed expect that the major alignment work is in formulating problem specification, and possibly subgoals/heuristics (depending on how much of that is automagically handled by instrumental convergence/natural abstraction). That’s basically the conclusion of the OP: outer alignment is still hard, but we can totally eliminate the inner alignment problem by retargeting the search.
        Separately: do you think we could easily “retarget the search” for an adult human, if we had mechanistic interpretability + edit access for the human’s brain? I’d expect “no”.
        I expect basically “yes”, although the result would be something quite different from a human.
        We can already give humans quite arbitrary tasks/jobs/objectives, and the humans will go figure out how to do it. I’m currently working on a post on this, and my opening example is Benito’s job; here are some things he’s had to do over the past couple years:
        build a prototype of an office
        resolve neighbor complaints at a party
        find housing for 13 people with 2 days notice
        figure out an invite list for 100+ people for an office
        deal with people emailing a funder trying to get him defunded
        set moderation policies for LessWrong
        write public explanations of grantmaking decisions
        organize weekly online zoom events
        ship books internationally by Christmas
        moderate online debates
        do April Fools’ Jokes on Lesswrong
        figure out which of 100s of applicants to do trial hires with
        So there’s clearly a retargetable search subprocess in there, and we do in fact retarget it on different tasks all the time.
        That said, in practice most humans seem to spend most of their time not really using the retargetable search process much; most people mostly just operate out of cache, and if pressed they’re unsure what to point the retargetable search process at. If we were to hardwire a human’s search process to a particular target, they’d single-mindedly pursue that one target (and subgoals thereof); that’s quite different from normal humans.
        What links here?
        Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
        Rohin Shah 15 Aug 2022 22:05 UTC
        LW: 8 AF: 6
        0
        AF Parent
        … Interesting. I’ve been thinking we were talking about (2) this entire time, since on my understanding of “mesa optimizers”, (1) is not a mesa optimizer (what would its mesa objective be?).
        If we’re imagining systems that look more like (1) I’m a lot more confused about how “retarget the search” is supposed to work. There’s clearly some part of the AI system (or the human, in the analogy) that is deciding how to retarget the search on the fly—is your proposal that we just chop that part off somehow, and replace it with a hardcoded concept of “human values” (or “user intent” or whatever)? If that sort of thing doesn’t hamstring the AI, why didn’t gradient descent do the same thing, except replacing it with a hardcoded concept of “reward” (which presumably a somewhat smart AGI would have)?
        What links here?
        Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
        johnswentworth 15 Aug 2022 22:42 UTC
        LW: 17 AF: 10
        7
        AF Parent
        So, part of the reason we expect a retargetable search process in the first place is that it’s useful for the AI to recursively call it with new subproblems on the fly; recursive search on subproblems is a useful search technique. What we actually want to retarget is not every instance of the search process, but just the “outermost call”; we still want it to be able to make recursive calls to the search process while solving our chosen problem.
        What links here?
        Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
        Rohin Shah 16 Aug 2022 7:53 UTC
        LW: 11 AF: 8
        5
        AF Parent
        Okay, I think this is a plausible architecture that a learned program could have, and I don’t see super strong reasons for “retarget the search” to fail on this particular architecture (though I do expect that if you flesh it out you’ll run into more problems, e.g. I’m not clear on where “concepts” live in this architecture and I could imagine that poses problems for retargeting the search).
        Personally I still expect systems to be significantly more tuned to the domains they were trained on, with search playing a more cursory role (which is also why I expect to have trouble retargeting a human’s search). But I agree that my reason (2) above doesn’t clearly apply to this architecture. I think the recursive aspect of the search was the main thing I wasn’t thinking about when I wrote my original comment.
        What links here?
        Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by RobertM (14 Sep 2023 2:18 UTC; 85 points)
        Evan R. Murphy 16 Aug 2022 23:09 UTC
        4 points
        0
        Parent
        Link to the post John mentions in the parent comment: https://www.alignmentforum.org/posts/6mysMAqvo9giHC4iX/what-s-general-purpose-search-and-why-might-we-expect-to-see