DanielFilan comments on Conditions for Mesa-Optimization

DanielFilan 5 Jun 2019 20:49 UTC
LW: 8 AF: 4
AF

To see this, we can think of optimization power as being measured in terms of the number of times the optimizer is able to divide the search space in half—that is, the number of bits of information provided.

This is pretty confusing for me: If I’m doing gradient descent, how many times am I halving the entire search space? (although I appreciate that it’s hard to come up with a better measure of optimisation)
- Rohin Shah 6 Jun 2019 17:31 UTC
  LW: 7 AF: 4
  AF Parent
  You could imagine that, if you use gradient descent to reach a loss value of $L$ , then amount of optimization applied in bits $= - log \frac{| {θ \in R^{d} : L (θ) \leq L} |}{| R^{d} |}$ . (Yes, I know I shouldn’t be taking sizes of continuous vector spaces, but you know what I mean.)
  - adamShimi 15 Oct 2020 15:26 UTC
    LW: 5 AF: 4
    AF Parent
    I think there is a typo in your formula, because the number of bits you get is negative. Going back to Yudkowsky’s post, I think the correct formula (using your approximations of sizes) is $log \frac{| R^{d} |}{| {θ \in R^{d} ∣ L (θ) \leq L} |}$ , or $- log \frac{| {θ \in R^{d} ∣ L (θ) \leq L} |}{| R^{d} |}$ to be closer to the entropy notation.
    - Rohin Shah 15 Oct 2020 17:57 UTC
      LW: 2 AF: 2
      AF Parent
      Yeah, you’re right, fixed.