For the “Crappy Optimizer Theorem”, I don’t understand why condition 4, that if f≤0 , then Q(s)(f)≤0 , isn’t just a tautology[1]. Surely if ∀x∈X,f(x)≤c , then no-matter what s:(X→R)→X is being used, as Q(s)(f):=f(s(f)) , then letting x=s(f) , then f(x)≤c , and so Q(s)(f)=f(s(f))=f(x)≤c .
I guess if the 4 conditions are seen as conditions on a function F:(X→R)→R (where they are written for F=Q(s) ), then it no-longer is automatic, and it is just when specifying that F=Q(s) for some s, that condition 4 becomes automatic?
______________
[start of section spitballing stuff based on the crappy optimizer theorem] Spitball 1: What if instead of saying s:(X→R)→X , we had s:(X→R)→ΔX ? would we still get the results of the crappy optimizer theorem? If we define if s(f) is now a distribution over X, then, I suppose instead of writing Q(s)(f)=f(s(f)) should write Q(s)(f) = s(f)(f) , and, in this case, the first 2 and 4th conditions seem just as reasonable. The third condition… seems like it should also be satisfied?
Spitball 2: While I would expect that the 4 conditions might not be exactly satisfied by, e.g. gradient descent, I would kind of expect basically any reasonable deterministic optimization process to at least “almost” satisfy them? (like, maybe gradient-descent-in-practice would fail condition 1 due to floating point errors, but not too badly in reasonable cases). Do you think that a modification of this theorem for functions Q(s) which only approximately satisfy conditions 1-3, would be reasonably achievable?
I might be stretching the meaning of “tautology” here. I mean something provable in our usual background mathematics, and which therefore adding it as an additional hypothesis to a theorem, doesn’t let us show anything that we couldn’t show without it being an explicit hypothesis.
For the “Crappy Optimizer Theorem”, I don’t understand why condition 4, that if f≤0 , then Q(s)(f)≤0 , isn’t just a tautology[1]. Surely if ∀x∈X,f(x)≤c , then no-matter what s:(X→R)→X is being used,
as Q(s)(f):=f(s(f)) , then letting x=s(f) , then f(x)≤c , and so Q(s)(f)=f(s(f))=f(x)≤c .
I guess if the 4 conditions are seen as conditions on a function F:(X→R)→R (where they are written for F=Q(s) ), then it no-longer is automatic, and it is just when specifying that F=Q(s) for some s, that condition 4 becomes automatic?
______________
[start of section spitballing stuff based on the crappy optimizer theorem]
Spitball 1:
What if instead of saying s:(X→R)→X , we had s:(X→R)→ΔX ? would we still get the results of the crappy optimizer theorem?
If we define if s(f) is now a distribution over X, then, I suppose instead of writing Q(s)(f)=f(s(f)) should write Q(s)(f) = s(f)(f) , and, in this case, the first 2 and 4th conditions seem just as reasonable. The third condition… seems like it should also be satisfied?
Spitball 2:
While I would expect that the 4 conditions might not be exactly satisfied by, e.g. gradient descent, I would kind of expect basically any reasonable deterministic optimization process to at least “almost” satisfy them? (like, maybe gradient-descent-in-practice would fail condition 1 due to floating point errors, but not too badly in reasonable cases).
Do you think that a modification of this theorem for functions Q(s) which only approximately satisfy conditions 1-3, would be reasonably achievable?
______________
I might be stretching the meaning of “tautology” here. I mean something provable in our usual background mathematics, and which therefore adding it as an additional hypothesis to a theorem, doesn’t let us show anything that we couldn’t show without it being an explicit hypothesis.