I am very confused by (2). It sounds like you are imagining that search necessarily means brute-force search (i.e. guess-and-check)? Like non-brute-force search is just not a thing? And therefore heuristics are necessarily a qualitatively different thing from search? But I don’t think you’re young enough to have never seen A* search, so presumably you know that formal heuristic search is a thing, and how to use relaxation to generate heuristics. What exactly do you imagine that the word “search” refers to?
I’ve definitely seen A* search and know how it works. I meant to allude to it (and lots of other algorithms that involve a clear goal) with this part:
It seems very plausible that the model considers, say, 10 plans and chooses the best one, or even 10^6 plans, but then most of the action is in which plans were generated in the first place and “retarget the search” doesn’t necessarily solve your problem.
If your AGI is doing an A* search, then I think “retarget the search” is not a great strategy, because you have to change both the goal specification and the heuristic, and it’s really unclear how you would change the heuristic even given a solution to outer alignment (because A* heuristics are incredibly specific to the setting and goal, and presumably have to become way more specialized than they are today in order for it to be more powerful than what we can do today).
That’s what relaxation-based methods are for; they automatically generate heuristics for A* search. For instance, in a maze, it’s very easy to find a solution if we relax all the “can’t cross this wall” constraints, and that yields the Euclidean distance heuristic. Also, to a large extent, those heuristics tend to depend on the environment but not on the goal—for instance, in the case of Euclidean distance in a maze, the heuristic applies to any pathfinding problem between two points (and probably many other problems too), not just to whatever particular start and end points the particular maze highlights. We can also view things like instrumentally convergent subgoals or natural abstractions as likely environment-specific (but not goal-specific) heuristics.
Those are the sort of pieces I imagine showing up as part of “general-purpose search” in trained systems: general methods for generating heuristics for a wide variety of goals, as well as some hard-coded environment-specific (but not goal-specific) heuristics.
(Note to readers: here’s another post (with comments from John) on the same topic, which I only just saw.)
I imagine two different kinds of AI systems you might be imagining:
An AI system that has a “subroutine” that runs A* search given a problem specification. The AI system works by formulating useful subgoals, converting those into A* problem specifications + heuristics, uses the A* subroutine, and then executes the result.
An AI system that literally is A* search. The AI has (in its weights, if it is a learned neural net) a high-level “state space of the universe”, a high-level “conceptual actions” space, an ability to predict the next high-level state given a previous state + conceptual action, and some goal function (= the mesa-objective). Given an input, the AI converts it into a high-level state, and runs A* with that state as the input, takes the resulting plan and executes the first action of the plan.
In (1), it seems like the major alignment work is in aligning the part of the AI system that formulates subgoals, problem specification, and heuristics, where it is not clear that “retarget the search” would work. (You could also try to excise the A* subroutine and use that as a general-purpose problem solver, but then you have to tune the heuristic manually; maybe you could excise the A* subroutine and the part that designs the heuristic, if you were lucky enough that those were fully decoupled from the subgoal-choosing part of the system.)
In (2), I don’t know why you expect to get general-purpose search instead of a very complex heuristic that’s very specific to the mesa objective. There is only ever one goal that the A* search has to optimize for; why wouldn’t gradient descent embed a bunch of goal-specific heuristics that improve efficiency? Are you saying that such heuristics don’t exist?
Separately: do you think we could easily “retarget the search” for an adult human, if we had mechanistic interpretability + edit access for the human’s brain? I’d expect “no”.
I’m imagining roughly (1), though with some caveats:
Of course it probably wouldn’t literally be A* search
Either the heuristic-generation is internal to the search subroutine, or it’s using a standard library of general-purpose heuristics for everything (or some combination of the two).
A lot of the subgoal formulation is itself internal to the search (i.e. recursively searching on subproblems is a standard search technique).
I do indeed expect that the major alignment work is in formulating problem specification, and possibly subgoals/heuristics (depending on how much of that is automagically handled by instrumental convergence/natural abstraction). That’s basically the conclusion of the OP: outer alignment is still hard, but we can totally eliminate the inner alignment problem by retargeting the search.
Separately: do you think we could easily “retarget the search” for an adult human, if we had mechanistic interpretability + edit access for the human’s brain? I’d expect “no”.
I expect basically “yes”, although the result would be something quite different from a human.
We can already give humans quite arbitrary tasks/jobs/objectives, and the humans will go figure out how to do it. I’m currently working on a post on this, and my opening example is Benito’s job; here are some things he’s had to do over the past couple years:
build a prototype of an office
resolve neighbor complaints at a party
find housing for 13 people with 2 days notice
figure out an invite list for 100+ people for an office
deal with people emailing a funder trying to get him defunded
set moderation policies for LessWrong
write public explanations of grantmaking decisions
organize weekly online zoom events
ship books internationally by Christmas
moderate online debates
do April Fools’ Jokes on Lesswrong
figure out which of 100s of applicants to do trial hires with
So there’s clearly a retargetable search subprocess in there, and we do in fact retarget it on different tasks all the time.
That said, in practice most humans seem to spend most of their time not really using the retargetable search process much; most people mostly just operate out of cache, and if pressed they’re unsure what to point the retargetable search process at. If we were to hardwire a human’s search process to a particular target, they’d single-mindedly pursue that one target (and subgoals thereof); that’s quite different from normal humans.
… Interesting. I’ve been thinking we were talking about (2) this entire time, since on my understanding of “mesa optimizers”, (1) is not a mesa optimizer (what would its mesa objective be?).
If we’re imagining systems that look more like (1) I’m a lot more confused about how “retarget the search” is supposed to work. There’s clearly some part of the AI system (or the human, in the analogy) that is deciding how to retarget the search on the fly—is your proposal that we just chop that part off somehow, and replace it with a hardcoded concept of “human values” (or “user intent” or whatever)? If that sort of thing doesn’t hamstring the AI, why didn’t gradient descent do the same thing, except replacing it with a hardcoded concept of “reward” (which presumably a somewhat smart AGI would have)?
So, part of the reason we expect a retargetable search process in the first place is that it’s useful for the AI to recursively call it with new subproblems on the fly; recursive search on subproblems is a useful search technique. What we actually want to retarget is not every instance of the search process, but just the “outermost call”; we still want it to be able to make recursive calls to the search process while solving our chosen problem.
Okay, I think this is a plausible architecture that a learned program could have, and I don’t see super strong reasons for “retarget the search” to fail on this particular architecture (though I do expect that if you flesh it out you’ll run into more problems, e.g. I’m not clear on where “concepts” live in this architecture and I could imagine that poses problems for retargeting the search).
Personally I still expect systems to be significantly more tuned to the domains they were trained on, with search playing a more cursory role (which is also why I expect to have trouble retargeting a human’s search). But I agree that my reason (2) above doesn’t clearly apply to this architecture. I think the recursive aspect of the search was the main thing I wasn’t thinking about when I wrote my original comment.
I am very confused by (2). It sounds like you are imagining that search necessarily means brute-force search (i.e. guess-and-check)? Like non-brute-force search is just not a thing? And therefore heuristics are necessarily a qualitatively different thing from search? But I don’t think you’re young enough to have never seen A* search, so presumably you know that formal heuristic search is a thing, and how to use relaxation to generate heuristics. What exactly do you imagine that the word “search” refers to?
I’ve definitely seen A* search and know how it works. I meant to allude to it (and lots of other algorithms that involve a clear goal) with this part:
If your AGI is doing an A* search, then I think “retarget the search” is not a great strategy, because you have to change both the goal specification and the heuristic, and it’s really unclear how you would change the heuristic even given a solution to outer alignment (because A* heuristics are incredibly specific to the setting and goal, and presumably have to become way more specialized than they are today in order for it to be more powerful than what we can do today).
That’s what relaxation-based methods are for; they automatically generate heuristics for A* search. For instance, in a maze, it’s very easy to find a solution if we relax all the “can’t cross this wall” constraints, and that yields the Euclidean distance heuristic. Also, to a large extent, those heuristics tend to depend on the environment but not on the goal—for instance, in the case of Euclidean distance in a maze, the heuristic applies to any pathfinding problem between two points (and probably many other problems too), not just to whatever particular start and end points the particular maze highlights. We can also view things like instrumentally convergent subgoals or natural abstractions as likely environment-specific (but not goal-specific) heuristics.
Those are the sort of pieces I imagine showing up as part of “general-purpose search” in trained systems: general methods for generating heuristics for a wide variety of goals, as well as some hard-coded environment-specific (but not goal-specific) heuristics.
(Note to readers: here’s another post (with comments from John) on the same topic, which I only just saw.)
I imagine two different kinds of AI systems you might be imagining:
An AI system that has a “subroutine” that runs A* search given a problem specification. The AI system works by formulating useful subgoals, converting those into A* problem specifications + heuristics, uses the A* subroutine, and then executes the result.
An AI system that literally is A* search. The AI has (in its weights, if it is a learned neural net) a high-level “state space of the universe”, a high-level “conceptual actions” space, an ability to predict the next high-level state given a previous state + conceptual action, and some goal function (= the mesa-objective). Given an input, the AI converts it into a high-level state, and runs A* with that state as the input, takes the resulting plan and executes the first action of the plan.
In (1), it seems like the major alignment work is in aligning the part of the AI system that formulates subgoals, problem specification, and heuristics, where it is not clear that “retarget the search” would work. (You could also try to excise the A* subroutine and use that as a general-purpose problem solver, but then you have to tune the heuristic manually; maybe you could excise the A* subroutine and the part that designs the heuristic, if you were lucky enough that those were fully decoupled from the subgoal-choosing part of the system.)
In (2), I don’t know why you expect to get general-purpose search instead of a very complex heuristic that’s very specific to the mesa objective. There is only ever one goal that the A* search has to optimize for; why wouldn’t gradient descent embed a bunch of goal-specific heuristics that improve efficiency? Are you saying that such heuristics don’t exist?
Separately: do you think we could easily “retarget the search” for an adult human, if we had mechanistic interpretability + edit access for the human’s brain? I’d expect “no”.
I’m imagining roughly (1), though with some caveats:
Of course it probably wouldn’t literally be A* search
Either the heuristic-generation is internal to the search subroutine, or it’s using a standard library of general-purpose heuristics for everything (or some combination of the two).
A lot of the subgoal formulation is itself internal to the search (i.e. recursively searching on subproblems is a standard search technique).
I do indeed expect that the major alignment work is in formulating problem specification, and possibly subgoals/heuristics (depending on how much of that is automagically handled by instrumental convergence/natural abstraction). That’s basically the conclusion of the OP: outer alignment is still hard, but we can totally eliminate the inner alignment problem by retargeting the search.
I expect basically “yes”, although the result would be something quite different from a human.
We can already give humans quite arbitrary tasks/jobs/objectives, and the humans will go figure out how to do it. I’m currently working on a post on this, and my opening example is Benito’s job; here are some things he’s had to do over the past couple years:
build a prototype of an office
resolve neighbor complaints at a party
find housing for 13 people with 2 days notice
figure out an invite list for 100+ people for an office
deal with people emailing a funder trying to get him defunded
set moderation policies for LessWrong
write public explanations of grantmaking decisions
organize weekly online zoom events
ship books internationally by Christmas
moderate online debates
do April Fools’ Jokes on Lesswrong
figure out which of 100s of applicants to do trial hires with
So there’s clearly a retargetable search subprocess in there, and we do in fact retarget it on different tasks all the time.
That said, in practice most humans seem to spend most of their time not really using the retargetable search process much; most people mostly just operate out of cache, and if pressed they’re unsure what to point the retargetable search process at. If we were to hardwire a human’s search process to a particular target, they’d single-mindedly pursue that one target (and subgoals thereof); that’s quite different from normal humans.
… Interesting. I’ve been thinking we were talking about (2) this entire time, since on my understanding of “mesa optimizers”, (1) is not a mesa optimizer (what would its mesa objective be?).
If we’re imagining systems that look more like (1) I’m a lot more confused about how “retarget the search” is supposed to work. There’s clearly some part of the AI system (or the human, in the analogy) that is deciding how to retarget the search on the fly—is your proposal that we just chop that part off somehow, and replace it with a hardcoded concept of “human values” (or “user intent” or whatever)? If that sort of thing doesn’t hamstring the AI, why didn’t gradient descent do the same thing, except replacing it with a hardcoded concept of “reward” (which presumably a somewhat smart AGI would have)?
So, part of the reason we expect a retargetable search process in the first place is that it’s useful for the AI to recursively call it with new subproblems on the fly; recursive search on subproblems is a useful search technique. What we actually want to retarget is not every instance of the search process, but just the “outermost call”; we still want it to be able to make recursive calls to the search process while solving our chosen problem.
Okay, I think this is a plausible architecture that a learned program could have, and I don’t see super strong reasons for “retarget the search” to fail on this particular architecture (though I do expect that if you flesh it out you’ll run into more problems, e.g. I’m not clear on where “concepts” live in this architecture and I could imagine that poses problems for retargeting the search).
Personally I still expect systems to be significantly more tuned to the domains they were trained on, with search playing a more cursory role (which is also why I expect to have trouble retargeting a human’s search). But I agree that my reason (2) above doesn’t clearly apply to this architecture. I think the recursive aspect of the search was the main thing I wasn’t thinking about when I wrote my original comment.
Link to the post John mentions in the parent comment: https://www.alignmentforum.org/posts/6mysMAqvo9giHC4iX/what-s-general-purpose-search-and-why-might-we-expect-to-see