To see this, we can think of optimization power as being measured in terms of the number of times the optimizer is able to divide the search space in half—that is, the number of bits of information provided.
This is pretty confusing for me: If I’m doing gradient descent, how many times am I halving the entire search space? (although I appreciate that it’s hard to come up with a better measure of optimisation)
You could imagine that, if you use gradient descent to reach a loss value of L, then amount of optimization applied in bits =−log|{θ∈Rd:L(θ)≤L}||Rd| . (Yes, I know I shouldn’t be taking sizes of continuous vector spaces, but you know what I mean.)
I think there is a typo in your formula, because the number of bits you get is negative. Going back to Yudkowsky’s post, I think the correct formula (using your approximations of sizes) is log|Rd||{θ∈Rd∣L(θ)≤L}|, or −log|{θ∈Rd∣L(θ)≤L}||Rd| to be closer to the entropy notation.
This is pretty confusing for me: If I’m doing gradient descent, how many times am I halving the entire search space? (although I appreciate that it’s hard to come up with a better measure of optimisation)
You could imagine that, if you use gradient descent to reach a loss value of L, then amount of optimization applied in bits =−log|{θ∈Rd:L(θ)≤L}||Rd| . (Yes, I know I shouldn’t be taking sizes of continuous vector spaces, but you know what I mean.)
I think there is a typo in your formula, because the number of bits you get is negative. Going back to Yudkowsky’s post, I think the correct formula (using your approximations of sizes) is log|Rd||{θ∈Rd∣L(θ)≤L}|, or −log|{θ∈Rd∣L(θ)≤L}||Rd| to be closer to the entropy notation.
Yeah, you’re right, fixed.