Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as “Prove this theorem: [...]” or “Compute the fastest route to Paris.” We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above.
Computing the fastest route to Paris doesn’t involve search?
More generally, I think in order for it to work your example can’t contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.
And it’s still not obvious to me that this is viable. It seems possible in principle (just imagine a sufficiently large look-up table!) but it seems like it probably wouldn’t be competitive with agents that do search at least to the extent that humans do. After all, humans evolved to do search over actions, but we totally didn’t have to—if bundles of heuristics worked equally well for the sort of complex environments we evolved in, then why didn’t we evolve that way instead?
EDIT: Just re-read and realized you are OK with subroutines that explicitly perform search over actions. But why? Doesn’t that undermine your argument? Like, suppose we have an architecture like this:
LOOP:State = GetStateOfWorld(Observation)
IF State == InPain:Cry&FlailAbout
IF State == AttractiveMateStraightAhead:MoveForward&Grin
Computing the fastest route to Paris doesn’t involve search?
More generally, I think in order for it to work your example can’t contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.
My example uses search, but the search is not the search of the inner alignment failure. It is merely a subroutine that is called upon by this outer superstructure, which itself is the part that is misaligned. Therefore, I fail to see why my point doesn’t follow.
If your position is that inner alignment failures must only occur when internal searches are misaligned with the reward function used during training, then my example would be a counterexample to your claim, since the reason for misalignment was not due to a search being misaligned (except under some unnatural rationalization of the agent, which is a source of disagreement highlighted in the post, and in my discussion with Evan above).
You are right; my comment was based on a misunderstanding of what you were saying. Hence why I unendorsed it.
(I read ” In this post, I will outline a general category of agents which may exhibit malign generalization without internal search, and then will provide a concrete example of an agent in the category. Then I will argue that, rather than being a very narrow counterexample, this class of agents could be competitive with search-based agents. ” and thought you meant agents that don’t use internal search at all.)
Computing the fastest route to Paris doesn’t involve search?
More generally, I think in order for it to work your example can’t contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.
And it’s still not obvious to me that this is viable. It seems possible in principle (just imagine a sufficiently large look-up table!) but it seems like it probably wouldn’t be competitive with agents that do search at least to the extent that humans do. After all, humans evolved to do search over actions, but we totally didn’t have to—if bundles of heuristics worked equally well for the sort of complex environments we evolved in, then why didn’t we evolve that way instead?
EDIT: Just re-read and realized you are OK with subroutines that explicitly perform search over actions. But why? Doesn’t that undermine your argument? Like, suppose we have an architecture like this:
LOOP:State = GetStateOfWorld(Observation)
IF State == InPain:Cry&FlailAbout
IF State == AttractiveMateStraightAhead:MoveForward&Grin
ELSE ==: Do(RunSubroutine[SearchOverActionsAndOutputActionThoughtToYieldGreatestExpectedNumberOfGrandchildren])
END_LOOP
This seems not meaningfully different from the version that doesn’t have the first two IF statements, as far as talk of optimizers is concerned.
My example uses search, but the search is not the search of the inner alignment failure. It is merely a subroutine that is called upon by this outer superstructure, which itself is the part that is misaligned. Therefore, I fail to see why my point doesn’t follow.
If your position is that inner alignment failures must only occur when internal searches are misaligned with the reward function used during training, then my example would be a counterexample to your claim, since the reason for misalignment was not due to a search being misaligned (except under some unnatural rationalization of the agent, which is a source of disagreement highlighted in the post, and in my discussion with Evan above).
You are right; my comment was based on a misunderstanding of what you were saying. Hence why I unendorsed it.
(I read ” In this post, I will outline a general category of agents which may exhibit malign generalization without internal search, and then will provide a concrete example of an agent in the category. Then I will argue that, rather than being a very narrow counterexample, this class of agents could be competitive with search-based agents. ” and thought you meant agents that don’t use internal search at all.)