My biggest objection to this definition is that it inherently requires time. At a bare minimum, there needs to be an “initial state” and a “final state” within the same state space, so we can talk about the system going from outside the target set to inside the target set.
One class of cases which definitely seem like optimization but do not satisfy this property at all: one-shot non-iterative optimization. For instance, I could write a convex function optimizer which works by symbolically differentiating the objective function and then algebraically solving for a point at which the gradient is zero.
Is there an argument that I should not consider this to be an optimizer?
My biggest objection to this definition is that it inherently requires time
Fascinating—but why is this an objection? Is it just the inelegance of not being able to look at a single time slice and answer the question of whether optimization is happening?
One class of cases which definitely seem like optimization but do not satisfy this property at all: one-shot non-iterative optimization.
Yes this is a fascinating case! I’d like to write a whole post about it. Here are my thoughts:
First, just as a fun fact, not that it’s actually extremely rare to see any non-iterative optimization in practical usage. When we solve linear equations, we could use gaussian elimination but it’s so unstable that in practice we use, most likely, the SVD, which is iterative. When we solve a system of polynomial equation we could use something like a Grobner basis or the resultant, but it’s so unstable that in practice we something like a companion matrix method, which comes down to an eigenvalue decomposition, which is again iterative.
Consider finding the roots of a simple quadratic equation (ie solving a cubic optimization problem). We can use the quadratic equation to do this. But ultimately this comes down to computing a square root, which is typically (though not necessarily) solved with an iterative method.
That these methods (for solving linear systems, polynomial systems, and quadratic equations) have at their heart an iterative optimization algorithm is not accidental. The iterative methods involved are not some small or sideline part of what’s going on. In fact when you solve a system of polynomial equations using a companion matrix, you spend a lot of energy rearranging the system into a form where it can be solved via an eigenvalue decomposition, and then the eigenvalue decomposition itself is very much operating on the full problem. It’s not some unimportant side operation. I find this fascinating.
Nevertheless it is possible to solve linear systems, polynomial systems etc with non-iterative methods.
These methods are definitely considered “optimization” by any normal use of that term. So in this way my definition doesn’t quite line up with the common language use of the word “optimization”.
But these non-iterative methods actually do not have the core property that I described in the square-root-of-two example. If I reach in and flip a bit while a Guassian elimination is running, the algorithm does not in any sense recover. Since the algorithm is just performing a linear sequence of steps, the error just grows and grows as the computation unfolds. This is the opposite of what happens if I reach in and flip a bit while an SVD is being computed: in this case the error will be driven back to zero by the iterative optimization algorithm.
You might say that my focus on error-correction simply doesn’t capture the common language use of the term optimization, as demonstrated by the fact that non-iterative optimization algorithms do not have this error-correcting property. You would be correct!
But perhaps my real response is that fundamentally I’m interested in these processes that somewhat mysteriously drive the state of the world towards a target configuration, and keep doing so despite perturbations. I think these are central to what AI and agency are. The term “optimizing system” might not be quite right, but it seems close enough to be compelling.
Thanks for the question—I clarified my own thinking while writing up this response.
Another big thing to note in examples like e.g. iteratively computing a square root for the quadratic formula or iteratively computing eigenvalues to solve a matrix: the optimization problems we’re solving are subproblems, not the original full problem. These crucially differ from most of the examples in the OP in that the system’s objective function (in your sense) does not match the objective function (in the usual intuitive sense). They’re iteratively optimizing a subproblem’s objective, not the “full” problem’s objective.
That’s potentially an issue for thinking about e.g. AI as an optimizer: if it’s using iterative optimization on subproblems, but using those results to perform some higher-level optimization in a non-iterative manner, then aligning the sobproblem-optimizers may not be synonymous with aligning the full AI. Indeed, I think a lot of reasoning works very much like this: we decompose a high-dimensional problem into coupled low-dimensional subproblems (i.e. “gears”), then apply iterative optimizers to the subproblems. That’s exactly how eigenvalue algorithms work, for instance: we decompose the full problem into a series of optimization subproblems in narrower and narrower subspaces, while the “high-level” part of the algorithm (i.e. outside the subproblems) doesn’t look like iterative optimization.
Fascinating—but why is this an objection? Is it just the inelegance of not being able to look at a single time slice and answer the question of whether optimization is happening?
No, the issue is that the usual definition of an optimization problem (e.g. maxxf(x)) has no built-in notion of time, and the intuitive notion of optimization (e.g. “the system makes Y big”) has no built-in notion of time (or at least linear time). It’s this really fundamental thing that isn’t present in the “original problem”, so to speak; it would be very surprising and interesting if time had to be involved when it’s not present from the start.
If I specifically try to brainstorm things-which-look-like-optimization-but-don’t-involve-objective-improvement-over-time, then it’s not hard to come up with examples:
Rather than a function-value “improving” along linear time, I could think about a function value improving along some tree or DAG—e.g. in a heap data structure, we have a tree where the “function value” always “improves” as we move from any leaf toward the root. There, any path from a leaf to the root could be considered “time” (but the whole set of nodes at the “same level” can’t be considered a time-slice, because we don’t have a meaningful way to compare whole sets of values; we could invent one, but it wouldn’t actually reflect the tree structure).
The example from the earlier comment: a one-shot non-iterative optimizer
A distributed optimizer: the system fans out, tests a whole bunch of possible choices in parallel, then selects the best of those.
Various flavors of constraint propagation, e.g. the simplex algorithm (and markets more generally)
I think this is a very important distinction. I prefer to use “maximizer” for “timelessly” finding the highest value of an objective function, and reserve “optimizer” for the kind of stepwise improvement discussed in this post. As I use the terms, to maximize something is to find the state with the highest value, but to optimize it is to take an initial state and find a new state with a higher value. I recognize that “optimize” and “optimizer” are sometimes used the way you’re saying, as basically synonymous with “maximize” / “maximizer”, and I could retreat to calling the inherently temporal thing I’m talking about an “improver” (or an “improvement process” if I don’t want to reify it), but this actually seems less likely to be quickly understood, and I don’t think it’s all that useful for “optimize” and “maximize” to mean exactly the same thing.
(There is a subset of optimizers as I (and this post, although I think the value should be graded rather than binary) use the term that in the limit reach the maximum, and a subset of those that even reach the maximum in a finite number of steps, but optimizers that e.g. get stuck in local maxima aren’t IMO thereby not actually optimizers, even though they aren’t maximizers in any useful sense.)
I think this is covered in my view of optimization via selection, where “direct solution” is the third option. Any one-shot optimizer is implicitly relying on an internal model completely for decision making, rather than iterating, as I explain there. I think that is compatible with the model here, but it needs to be extended slightly to cover what I was trying to say there.
This model is explicitly requiring that you deal only with physical processes, so your convex function solver would require time to get from the starting state to the end state. If it is happening non-iteratively then it would cease to be an optimizing system after it has completed the function, since there is no longer a target configuration.
I’m not sure what you’re trying to say here. What’s the state space (in which both the start and end state of the optimizer live), what’s the basin of attraction (i.e. set of allowed initial conditions), and what’s the target region within the state space? And remember, the target region needs to be a subset of the allowed initial conditions.
This end state state is the solution to the convex function being stored in some physical registers. The initial state is those registers containing arbitrary data to be overwritten. It’s not particularly interesting as optimization problems go (not a very large basin of attraction) but it fulfills the basic criteria.
The unique thing about your example is that it solves once and then it is done (relative to the examples in the post), so it ceases to be an optimizing system once it finishes computing the solution to your convex function.
With a slight modification, you could be repeating this algorithm in a loop so it constantly recalculates a new function. Now the initial state can be some value in the result and input registers, and the target region is the set of input equations and appropriate solution in the output registers. It widens the basin of attraction to both the input and output registers rather than just the output.
There’s no reason why that target set would be smaller than the basin of attraction. Given one such optimization problem, there are no obvious perturbations we could make which would leave the result in the target region.
The target region is not a subset of the basin of attraction. The system doesn’t evolve from a larger region to a smaller subset (as in the Venn-diagram visuals in the OP), it just evolves from one set to another.
The first problem explicitly violates the OP’s definition of an optimizer, and the second problem violates one of the unspoken assumptions present in all of the OP’s examples.
I don’t believe that either of these points are true. In your original example, there is one correct solution for any convex function. I will assume there is a single hard-coded function for the following, but it can be extended to work for an arbitrary function.
The output register having the correct solution is the target set.
The output register having any state is the basin of attraction.
Clearly any specific number (or rather singleton of that number) is a subset of all numbers, so the target is a subset of the basin. And further, because “all numbers” has more than one element, the target set is smaller than the basin.
Pretty much, yes, according to definition given. Like I said, not a particularly interesting optimization but an optimization none the less.
To extend on this, the basin of optimization is not any smaller than an iterative process acting on a single register (and if you loop the program, then the time horizon is the same). In both cases your basin is anything in that register and the target state is one particular number in that register. As far as I can tell the definition doesn’t have any way of saying that one is “more of an optimizer” than the other. If anything, the fixed output is more optimized because it arrives more quickly.
Ok, well, it seems like the one-shot non-iterative optimizer is an optimizer in a MUCH stronger sense than a random program, and I’d still expect a definition of optimization to say something about the sense in which that holds.
My biggest objection to this definition is that it inherently requires time. At a bare minimum, there needs to be an “initial state” and a “final state” within the same state space, so we can talk about the system going from outside the target set to inside the target set.
One class of cases which definitely seem like optimization but do not satisfy this property at all: one-shot non-iterative optimization. For instance, I could write a convex function optimizer which works by symbolically differentiating the objective function and then algebraically solving for a point at which the gradient is zero.
Is there an argument that I should not consider this to be an optimizer?
Fascinating—but why is this an objection? Is it just the inelegance of not being able to look at a single time slice and answer the question of whether optimization is happening?
Yes this is a fascinating case! I’d like to write a whole post about it. Here are my thoughts:
First, just as a fun fact, not that it’s actually extremely rare to see any non-iterative optimization in practical usage. When we solve linear equations, we could use gaussian elimination but it’s so unstable that in practice we use, most likely, the SVD, which is iterative. When we solve a system of polynomial equation we could use something like a Grobner basis or the resultant, but it’s so unstable that in practice we something like a companion matrix method, which comes down to an eigenvalue decomposition, which is again iterative.
Consider finding the roots of a simple quadratic equation (ie solving a cubic optimization problem). We can use the quadratic equation to do this. But ultimately this comes down to computing a square root, which is typically (though not necessarily) solved with an iterative method.
That these methods (for solving linear systems, polynomial systems, and quadratic equations) have at their heart an iterative optimization algorithm is not accidental. The iterative methods involved are not some small or sideline part of what’s going on. In fact when you solve a system of polynomial equations using a companion matrix, you spend a lot of energy rearranging the system into a form where it can be solved via an eigenvalue decomposition, and then the eigenvalue decomposition itself is very much operating on the full problem. It’s not some unimportant side operation. I find this fascinating.
Nevertheless it is possible to solve linear systems, polynomial systems etc with non-iterative methods.
These methods are definitely considered “optimization” by any normal use of that term. So in this way my definition doesn’t quite line up with the common language use of the word “optimization”.
But these non-iterative methods actually do not have the core property that I described in the square-root-of-two example. If I reach in and flip a bit while a Guassian elimination is running, the algorithm does not in any sense recover. Since the algorithm is just performing a linear sequence of steps, the error just grows and grows as the computation unfolds. This is the opposite of what happens if I reach in and flip a bit while an SVD is being computed: in this case the error will be driven back to zero by the iterative optimization algorithm.
You might say that my focus on error-correction simply doesn’t capture the common language use of the term optimization, as demonstrated by the fact that non-iterative optimization algorithms do not have this error-correcting property. You would be correct!
But perhaps my real response is that fundamentally I’m interested in these processes that somewhat mysteriously drive the state of the world towards a target configuration, and keep doing so despite perturbations. I think these are central to what AI and agency are. The term “optimizing system” might not be quite right, but it seems close enough to be compelling.
Thanks for the question—I clarified my own thinking while writing up this response.
Another big thing to note in examples like e.g. iteratively computing a square root for the quadratic formula or iteratively computing eigenvalues to solve a matrix: the optimization problems we’re solving are subproblems, not the original full problem. These crucially differ from most of the examples in the OP in that the system’s objective function (in your sense) does not match the objective function (in the usual intuitive sense). They’re iteratively optimizing a subproblem’s objective, not the “full” problem’s objective.
That’s potentially an issue for thinking about e.g. AI as an optimizer: if it’s using iterative optimization on subproblems, but using those results to perform some higher-level optimization in a non-iterative manner, then aligning the sobproblem-optimizers may not be synonymous with aligning the full AI. Indeed, I think a lot of reasoning works very much like this: we decompose a high-dimensional problem into coupled low-dimensional subproblems (i.e. “gears”), then apply iterative optimizers to the subproblems. That’s exactly how eigenvalue algorithms work, for instance: we decompose the full problem into a series of optimization subproblems in narrower and narrower subspaces, while the “high-level” part of the algorithm (i.e. outside the subproblems) doesn’t look like iterative optimization.
No, the issue is that the usual definition of an optimization problem (e.g. maxx f(x)) has no built-in notion of time, and the intuitive notion of optimization (e.g. “the system makes Y big”) has no built-in notion of time (or at least linear time). It’s this really fundamental thing that isn’t present in the “original problem”, so to speak; it would be very surprising and interesting if time had to be involved when it’s not present from the start.
If I specifically try to brainstorm things-which-look-like-optimization-but-don’t-involve-objective-improvement-over-time, then it’s not hard to come up with examples:
Rather than a function-value “improving” along linear time, I could think about a function value improving along some tree or DAG—e.g. in a heap data structure, we have a tree where the “function value” always “improves” as we move from any leaf toward the root. There, any path from a leaf to the root could be considered “time” (but the whole set of nodes at the “same level” can’t be considered a time-slice, because we don’t have a meaningful way to compare whole sets of values; we could invent one, but it wouldn’t actually reflect the tree structure).
The example from the earlier comment: a one-shot non-iterative optimizer
A distributed optimizer: the system fans out, tests a whole bunch of possible choices in parallel, then selects the best of those.
Various flavors of constraint propagation, e.g. the simplex algorithm (and markets more generally)
I think this is a very important distinction. I prefer to use “maximizer” for “timelessly” finding the highest value of an objective function, and reserve “optimizer” for the kind of stepwise improvement discussed in this post. As I use the terms, to maximize something is to find the state with the highest value, but to optimize it is to take an initial state and find a new state with a higher value. I recognize that “optimize” and “optimizer” are sometimes used the way you’re saying, as basically synonymous with “maximize” / “maximizer”, and I could retreat to calling the inherently temporal thing I’m talking about an “improver” (or an “improvement process” if I don’t want to reify it), but this actually seems less likely to be quickly understood, and I don’t think it’s all that useful for “optimize” and “maximize” to mean exactly the same thing.
(There is a subset of optimizers as I (and this post, although I think the value should be graded rather than binary) use the term that in the limit reach the maximum, and a subset of those that even reach the maximum in a finite number of steps, but optimizers that e.g. get stuck in local maxima aren’t IMO thereby not actually optimizers, even though they aren’t maximizers in any useful sense.)
I think this is covered in my view of optimization via selection, where “direct solution” is the third option. Any one-shot optimizer is implicitly relying on an internal model completely for decision making, rather than iterating, as I explain there. I think that is compatible with the model here, but it needs to be extended slightly to cover what I was trying to say there.
This model is explicitly requiring that you deal only with physical processes, so your convex function solver would require time to get from the starting state to the end state. If it is happening non-iteratively then it would cease to be an optimizing system after it has completed the function, since there is no longer a target configuration.
I’m not sure what you’re trying to say here. What’s the state space (in which both the start and end state of the optimizer live), what’s the basin of attraction (i.e. set of allowed initial conditions), and what’s the target region within the state space? And remember, the target region needs to be a subset of the allowed initial conditions.
This end state state is the solution to the convex function being stored in some physical registers. The initial state is those registers containing arbitrary data to be overwritten. It’s not particularly interesting as optimization problems go (not a very large basin of attraction) but it fulfills the basic criteria.
The unique thing about your example is that it solves once and then it is done (relative to the examples in the post), so it ceases to be an optimizing system once it finishes computing the solution to your convex function.
With a slight modification, you could be repeating this algorithm in a loop so it constantly recalculates a new function. Now the initial state can be some value in the result and input registers, and the target region is the set of input equations and appropriate solution in the output registers. It widens the basin of attraction to both the input and output registers rather than just the output.
Ok, two problems with this:
There’s no reason why that target set would be smaller than the basin of attraction. Given one such optimization problem, there are no obvious perturbations we could make which would leave the result in the target region.
The target region is not a subset of the basin of attraction. The system doesn’t evolve from a larger region to a smaller subset (as in the Venn-diagram visuals in the OP), it just evolves from one set to another.
The first problem explicitly violates the OP’s definition of an optimizer, and the second problem violates one of the unspoken assumptions present in all of the OP’s examples.
I don’t believe that either of these points are true. In your original example, there is one correct solution for any convex function. I will assume there is a single hard-coded function for the following, but it can be extended to work for an arbitrary function.
The output register having the correct solution is the target set.
The output register having any state is the basin of attraction.
Clearly any specific number (or rather singleton of that number) is a subset of all numbers, so the target is a subset of the basin. And further, because “all numbers” has more than one element, the target set is smaller than the basin.
This argument applies to literally any deterministic program with nonempty output. Are you saying that every program is an optimizer?
Pretty much, yes, according to definition given. Like I said, not a particularly interesting optimization but an optimization none the less.
To extend on this, the basin of optimization is not any smaller than an iterative process acting on a single register (and if you loop the program, then the time horizon is the same). In both cases your basin is anything in that register and the target state is one particular number in that register. As far as I can tell the definition doesn’t have any way of saying that one is “more of an optimizer” than the other. If anything, the fixed output is more optimized because it arrives more quickly.
Ok, well, it seems like the one-shot non-iterative optimizer is an optimizer in a MUCH stronger sense than a random program, and I’d still expect a definition of optimization to say something about the sense in which that holds.