Agreed that “search” is not a binary but more like a continuum, where we might call a program more “search-like” if it is enumerating possible actions and evaluating their consequences, and less “search-like” if it is directly mapping representations of inputs to actions. The argument in this post is that gradient descent (unlike evolution, and unlike human programmers) doesn’t select much for “search-like” programs. If we take depth-first search as a central example of search, and a thermostat as the paradigmatic non-search program, gradient descent will select for something more like the thermostat.
it’s totally possible to embed a few steps of gradient descent into the inference of a neural network, since gradient descent is differentiable
Agreed, and networks may even be learning something like this already! But in my ontology I wouldn’t call an algorithm that performs, say, 5 steps of gradient descent over a billion-parameter space and then outputs an action very “search-like”; the “search” part is generating a tiny fraction of the optimization pressure, relative to whatever process sets up the initial state and the error signal.
Maybe this is just semantics, because for high levels of capability search and control are not fundamentally different (what you’re pointing to with “much more efficient search”—an infinitely efficient search is just optimal control, you never even consider suboptimal actions!). But it does seem like for a fixed level of capabilities search is more brittle, somehow, and more likely to misgeneralize catastrophically.
Agreed that “search” is not a binary but more like a continuum, where we might call a program more “search-like” if it is enumerating possible actions and evaluating their consequences, and less “search-like” if it is directly mapping representations of inputs to actions. The argument in this post is that gradient descent (unlike evolution, and unlike human programmers) doesn’t select much for “search-like” programs. If we take depth-first search as a central example of search, and a thermostat as the paradigmatic non-search program, gradient descent will select for something more like the thermostat.
Agreed, and networks may even be learning something like this already! But in my ontology I wouldn’t call an algorithm that performs, say, 5 steps of gradient descent over a billion-parameter space and then outputs an action very “search-like”; the “search” part is generating a tiny fraction of the optimization pressure, relative to whatever process sets up the initial state and the error signal.
Maybe this is just semantics, because for high levels of capability search and control are not fundamentally different (what you’re pointing to with “much more efficient search”—an infinitely efficient search is just optimal control, you never even consider suboptimal actions!). But it does seem like for a fixed level of capabilities search is more brittle, somehow, and more likely to misgeneralize catastrophically.