I disagree that mesa optimization requires explicit representation of values. Consider an RL-type system that (1) learns strategies that work well in its training data, and then (2) generalizes to new strategies that in some sense fit well or are parsimonious with respect to its existing strategies. Strategies need not be explicitly represented. Nonetheless, it’s possible for those initially learned strategies to implicitly bake in what we could call foundational goals or values, that the system never updates away from.
For another angle, consider that value-directed thought can be obfuscated. A single central value could be transformed into a cloud of interlocking heuristics that manage to implement essentially the same logic. (This might make it more difficult to generalize that value, but not impossible.) This is a common strategy in humans, in situations where they want to avoid been seen as holding certain values, but still reap the benefits of effectively acting according to those values.
I disagree that mesa optimization requires explicit representation of values. Consider an RL-type system that (1) learns strategies that work well in its training data, and then (2) generalizes to new strategies that in some sense fit well or are parsimonious with respect to its existing strategies. Strategies need not be explicitly represented. Nonetheless, it’s possible for those initially learned strategies to implicitly bake in what we could call foundational goals or values, that the system never updates away from.
For another angle, consider that value-directed thought can be obfuscated. A single central value could be transformed into a cloud of interlocking heuristics that manage to implement essentially the same logic. (This might make it more difficult to generalize that value, but not impossible.) This is a common strategy in humans, in situations where they want to avoid been seen as holding certain values, but still reap the benefits of effectively acting according to those values.