I don’t know how much you can really generalise these lessons. For instance, when you say:
How much slower is e-coli optimization compared to gradient descent? What’s the cost of experimenting with random directions, rather than going in the “best” direction? Well, imagine an inclined plane in n dimensions. There’s exactly one “downhill” direction (the gradient). The n-1 directions perpendicular to the gradient don’t go downhill at all; they’re all just flat. If we take a one-unit step along each of these directions, one after another, then we’ll take n steps, but only 1 step will be downhill. In other words, only ~O(1/n) of our travel-effort is useful; the rest is wasted.
In a two-dimensional space, that means ~50% of effort is wasted. In three dimensions, 70%. In a thousand-dimensional space, ~99.9% of effort is wasted.
This is true, but if I go in a spherically random direction, then if my step size is small enough, ~50% of my efforts will be rewarded, regardless of the dimensionality of the space.
How best to go about optimisation depends on the cost of carrying out optimisation, the structure of the landscape, and the relationship between the utility and the quality of the final solution.
Blind guess and check is sometimes a perfectly valid method when you don’t need to find a very good solution, and you can’t make useful assumptions about the structure of the set, even if the carnality of the possible solution set is massive.
I often don’t even think “optimisation” and “dimensionality” are really natural ways of thinking about solving may real world engineering problems. There’s definitely an optimisation component to engineering process, but it’s often not central. Depending on circumstances, it can make more sense to think of engineering as “satisficing” vs “optimising”. Essentially, you’re trying to find a solution instead of the best solution, and the process used to solve the problem is going to look vastly different in one case vs another. This is similar to the notions of “goal directed agency” vs “utility maximisation”.
In many cases when engineering, you’re taking a problem and coming up with possible high level breakdowns of the problem. In the example of bridges, this could be deciding weather to use a cantilever bridge or a suspension bridge or something else entirely. From there, you solve the related sub-problems that have been created by the breakdown, until you’ve sufficiently fleshed out a solution that looks actionable.
The way you go about this depends on your optimisation budget. In increasing order of costs:
-You might go with the first solution that looks like it will work.
-You’ll recursively do a sort of heuristic optimisation at each level, decide on a solution, and move to the next level
-You’ll flesh out multiple different high level solutions and compare them.
. . .
-You search the entire space of possible solutions
This is where the whole “slack” thing and getting stuck in local optima comes back, even in high dimensional spaces. In many cases, you’re often “committed” to a subset of the solution space. This could be because you’ve decided to design a cantilever bridge instead of a suspension bridge. It could also be because you need to follow a design you know will work, and X is the only design your predecessors have implemented IRL that has been sufficiently vetted. (This is especially common in aircraft design, as the margins of error are very tight) It could even be because you’re comrades have all opted to go with a certain component, and so that component benefits from economies of scale and becomes the best choice even if another component would be objectively better we’re it to be mass produced.(I’ll leave it as an exercise to the reader to think of analogous problems in software engineering)
In all cases, you are forced to optimise within a subset of the search space. If you have the necessary slack, you can afford to explore the other parts of the search space to find better optima.
This is true, but if I go in a random direction, then if my step size is small enough, ~50% of my efforts will be rewarded, regardless of the dimensionality of the space.
50% of your steps will go in a positive rather than negative direction, but that completely ignores how much positive progress is made. The problems of high dimensionality are entirely problems of efficiency; “how much improvement?” is the key question.
Depending on circumstances, it can make or sense to think of engineering as “satisficing” vs “optimising”.
Satisficing is a strict subset of optimizing, and most of the OP applies just as well to satisficing. The processes used in practice are almost identical, other than the stopping condition: satisficers stop at the first acceptable solution, while optimizers are usually open-ended. The search process itself is usually the same.
Your comments on restrictions of the search space are good examples, but “slack” isn’t really the relevant concept here. Really, most of these examples are additional problem constraints, imposed by a group you’re working with. It’s not that slack would let you explore other solutions (other than the commitment-to-cantilever example) - it’s not like you could find better solutions by exploring more, with the same problem constraints in place. Rather, removing those constraints allows better solutions.
If instead of going one step in one of n directions, we go sqrt(1/n) forward or backward in each of the n directions (for a total step size of 1), we try an expected number of twice in order to get sqrt(1/n) progress, for a total effort factor of O(1/sqrt(n)). (O is the technical term for ~ ^^)
Yup, this is similar to Unnamed’s comment. It works in problems where the “cost” of trying a change is determined by Euclidean distance, but once we go beyond that, it falls apart.
Darn it, missed that comment. But how does Euclidean distance fail? I’m imagining the dimensions as the weights of a neural net, and e-coli optimization being used because we don’t have access to a gradient. The common metric I see that would have worse high-dimensional behavior is Manhattan distance. Is it that neighborhoods of low Manhattan distance tend to have more predictable/homogenous behavior than those of low Euclidean distance?
It seems like the situation with bridges is roughly analagous to neural networks: the cost has nothing to do with how much you change the design (distance) but instead is proportional to how many times you change the design. Evaluating any change, big or small, requires building a bridge (or more likely, simulating a bridge). So you can’t just take a tiny step in each of n directions, because it would still have n times the cost of taking a step in one direction. E. Coli is actually pretty unusual in that the evaluation is nearly free, but the change in position is expensive.
I don’t have a fully-fleshed-out answer, but here’s a thought experiment: change the units of all the inputs. Multiply one of them by 12, another by 0.01, another by 10^8, etc.
In a system without any sort of symmetry or natural Euclidean distance metric over the inputs, it’s like that transformation has already happened before we started. All the inputs have completely different and arbitrary units. If we take a “one unit” step, in actual practice that will be massive along some of the inputs and inconsequential along others—whereas our sqrt(1/n) strategy relies on the step being effectively similar along each dimension.
The right way to talk about this is probably condition number of local curvature, but I haven’t tried to properly formalize it.
In that thought experiment, Euclidean distance doesn’t work because different dimensions have different units. To fix that, you could move to the log scale. Or is the transformation actually more complicated than multiplication?
You can’t take logarithms of non-positive numbers. Even if you know some parameter is non-positive, you still have a free choice of scale, so the problem of comparing scales of different parameters does not go away.
You can, actually. ln(5cm)=ln(5)+ln(cm), and since we measure distances, the ln(cm) cancels out. The same way, ln(-5)=ln(5)+ln(-1). ln(-1) happens to be pi*i, since e^(pi*i) is −1.
Since we measure distances, the ln(-1) cancels out. Also, we are comparing how well each setting of the parameters performs; for that, whether the parameters are complex-valued doesn’t matter.
Erm, I think you’re getting mixed up between comparing parameters and comparing the results of applying some function to parameters. These are not the same, and it’s the latter that become incomparable.
(Also, would your algorithm derive that ln(4+3i)=ln(5) since |4+3i|=|5|? I really don’t expect the “since we measure distances” trick to work, but if it does work, it should also work on this example.)
As far as I saw, you were getting mixed up on that. We never compare the parameter-vectors for being greater than/less than each other; they aren’t ordered.
(No, if some parameter started out with such values as 4+3i or 5, the ln transformation would not equate them. But multiplying both by e^0.01 would add 0.01 to both logarithms, regardless of previous units.)
I think at this point we should just ask @johnswentworth which one of us understood him correctly. As far as I see, we measure a distance between vectors, not between individual parameters, and that’s why this thing fails.
Epistemic status: Ramblings
I don’t know how much you can really generalise these lessons. For instance, when you say:
This is true, but if I go in a spherically random direction, then if my step size is small enough, ~50% of my efforts will be rewarded, regardless of the dimensionality of the space.
How best to go about optimisation depends on the cost of carrying out optimisation, the structure of the landscape, and the relationship between the utility and the quality of the final solution.
Blind guess and check is sometimes a perfectly valid method when you don’t need to find a very good solution, and you can’t make useful assumptions about the structure of the set, even if the carnality of the possible solution set is massive.
I often don’t even think “optimisation” and “dimensionality” are really natural ways of thinking about solving may real world engineering problems. There’s definitely an optimisation component to engineering process, but it’s often not central. Depending on circumstances, it can make more sense to think of engineering as “satisficing” vs “optimising”. Essentially, you’re trying to find a solution instead of the best solution, and the process used to solve the problem is going to look vastly different in one case vs another. This is similar to the notions of “goal directed agency” vs “utility maximisation”.
In many cases when engineering, you’re taking a problem and coming up with possible high level breakdowns of the problem. In the example of bridges, this could be deciding weather to use a cantilever bridge or a suspension bridge or something else entirely. From there, you solve the related sub-problems that have been created by the breakdown, until you’ve sufficiently fleshed out a solution that looks actionable.
The way you go about this depends on your optimisation budget. In increasing order of costs:
-You might go with the first solution that looks like it will work.
-You’ll recursively do a sort of heuristic optimisation at each level, decide on a solution, and move to the next level
-You’ll flesh out multiple different high level solutions and compare them.
. . .
-You search the entire space of possible solutions
This is where the whole “slack” thing and getting stuck in local optima comes back, even in high dimensional spaces. In many cases, you’re often “committed” to a subset of the solution space. This could be because you’ve decided to design a cantilever bridge instead of a suspension bridge. It could also be because you need to follow a design you know will work, and X is the only design your predecessors have implemented IRL that has been sufficiently vetted. (This is especially common in aircraft design, as the margins of error are very tight) It could even be because you’re comrades have all opted to go with a certain component, and so that component benefits from economies of scale and becomes the best choice even if another component would be objectively better we’re it to be mass produced.(I’ll leave it as an exercise to the reader to think of analogous problems in software engineering)
In all cases, you are forced to optimise within a subset of the search space. If you have the necessary slack, you can afford to explore the other parts of the search space to find better optima.
50% of your steps will go in a positive rather than negative direction, but that completely ignores how much positive progress is made. The problems of high dimensionality are entirely problems of efficiency; “how much improvement?” is the key question.
Satisficing is a strict subset of optimizing, and most of the OP applies just as well to satisficing. The processes used in practice are almost identical, other than the stopping condition: satisficers stop at the first acceptable solution, while optimizers are usually open-ended. The search process itself is usually the same.
Your comments on restrictions of the search space are good examples, but “slack” isn’t really the relevant concept here. Really, most of these examples are additional problem constraints, imposed by a group you’re working with. It’s not that slack would let you explore other solutions (other than the commitment-to-cantilever example) - it’s not like you could find better solutions by exploring more, with the same problem constraints in place. Rather, removing those constraints allows better solutions.
If instead of going one step in one of n directions, we go sqrt(1/n) forward or backward in each of the n directions (for a total step size of 1), we try an expected number of twice in order to get sqrt(1/n) progress, for a total effort factor of O(1/sqrt(n)). (O is the technical term for ~ ^^)
Yup, this is similar to Unnamed’s comment. It works in problems where the “cost” of trying a change is determined by Euclidean distance, but once we go beyond that, it falls apart.
Darn it, missed that comment. But how does Euclidean distance fail? I’m imagining the dimensions as the weights of a neural net, and e-coli optimization being used because we don’t have access to a gradient. The common metric I see that would have worse high-dimensional behavior is Manhattan distance. Is it that neighborhoods of low Manhattan distance tend to have more predictable/homogenous behavior than those of low Euclidean distance?
It seems like the situation with bridges is roughly analagous to neural networks: the cost has nothing to do with how much you change the design (distance) but instead is proportional to how many times you change the design. Evaluating any change, big or small, requires building a bridge (or more likely, simulating a bridge). So you can’t just take a tiny step in each of n directions, because it would still have n times the cost of taking a step in one direction. E. Coli is actually pretty unusual in that the evaluation is nearly free, but the change in position is expensive.
I don’t have a fully-fleshed-out answer, but here’s a thought experiment: change the units of all the inputs. Multiply one of them by 12, another by 0.01, another by 10^8, etc.
In a system without any sort of symmetry or natural Euclidean distance metric over the inputs, it’s like that transformation has already happened before we started. All the inputs have completely different and arbitrary units. If we take a “one unit” step, in actual practice that will be massive along some of the inputs and inconsequential along others—whereas our sqrt(1/n) strategy relies on the step being effectively similar along each dimension.
The right way to talk about this is probably condition number of local curvature, but I haven’t tried to properly formalize it.
In that thought experiment, Euclidean distance doesn’t work because different dimensions have different units. To fix that, you could move to the log scale. Or is the transformation actually more complicated than multiplication?
Yeah, then we’d just use a more complicated single-variable transformation in the thought experiment.
You can’t take logarithms of non-positive numbers. Even if you know some parameter is non-positive, you still have a free choice of scale, so the problem of comparing scales of different parameters does not go away.
You can, actually. ln(5cm)=ln(5)+ln(cm), and since we measure distances, the ln(cm) cancels out. The same way, ln(-5)=ln(5)+ln(-1). ln(-1) happens to be pi*i, since e^(pi*i) is −1.
If you allow complex numbers, comparison “greater than/less than” breaks down.
Since we measure distances, the ln(-1) cancels out. Also, we are comparing how well each setting of the parameters performs; for that, whether the parameters are complex-valued doesn’t matter.
Erm, I think you’re getting mixed up between comparing parameters and comparing the results of applying some function to parameters. These are not the same, and it’s the latter that become incomparable.
(Also, would your algorithm derive that ln(4+3i)=ln(5) since |4+3i|=|5|? I really don’t expect the “since we measure distances” trick to work, but if it does work, it should also work on this example.)
As far as I saw, you were getting mixed up on that. We never compare the parameter-vectors for being greater than/less than each other; they aren’t ordered.
(No, if some parameter started out with such values as 4+3i or 5, the ln transformation would not equate them. But multiplying both by e^0.01 would add 0.01 to both logarithms, regardless of previous units.)
I think at this point we should just ask @johnswentworth which one of us understood him correctly. As far as I see, we measure a distance between vectors, not between individual parameters, and that’s why this thing fails.