We can probably rule out “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).
Not sure if I disagree with the object-level assertion, but I think some important caveats are missing here. We have to take the plausible paths through algorithm-space the SGD is likely to take as well, and that might change the form of the final compressed policy in non-intuitive ways.
Another compact policy is “a superintelligence with a messy slew of values that figured out the training context and maneuvered the SGD around to learn the reward function without internalizing it + compress itself while keeping its messy values static”, and I think it’s a probable-enough end-point.
It’s still likely that the “messy slew of values” won’t be that messy and will be near-perfect correlates for the reward, but given some (environment structure, reward) pairs, neither may be true. E. g., if the setup is such that strategic intelligence somehow develops well before the AI achieves optimal performance on the training set, then that intelligence will set in stone proxy objectives that aren’t good correlates of the reward.
Not sure if I disagree with the object-level assertion, but I think some important caveats are missing here. We have to take the plausible paths through algorithm-space the SGD is likely to take as well, and that might change the form of the final compressed policy in non-intuitive ways.
Another compact policy is “a superintelligence with a messy slew of values that figured out the training context and maneuvered the SGD around to learn the reward function without internalizing it + compress itself while keeping its messy values static”, and I think it’s a probable-enough end-point.
It’s still likely that the “messy slew of values” won’t be that messy and will be near-perfect correlates for the reward, but given some (environment structure, reward) pairs, neither may be true. E. g., if the setup is such that strategic intelligence somehow develops well before the AI achieves optimal performance on the training set, then that intelligence will set in stone proxy objectives that aren’t good correlates of the reward.