The Legg-Hutter definition of intelligence is counterfactual (“if it had X goal, it would do a good job of achieving it”). It seems to me that the counterfactual definition isn’t necessary to capture the idea above. The LH definition also needs a measure over environments (including reward functions), and it’s not obvious how closely their proposed measure corresponds to things we’re interested in, while influentialness in the world we live in seems to correspond very closely.
The mesa-optimizer paper also stresses (not sure if correctly) that they’re not talking about input-output properties.
Influentialness in the real world definitely screens off training environment performance WRT impact in the real world, because that’s what the definition of influentialness is. If training environment performance fully screens off real world influentialness then this means that the outcome of training is essentially deterministic, which is not obviously always going to be true. Even if it is, kind of real world influence any given training environment maps to will be a bit random, which might ultimately yield a similar result.
When you say “we’ll reward intelligence”, your meaning is similar to when I say “training environment and real world might be highly compatible”.
I think an idea like Legg-Hutter intelligence could be useful for arguing that the environments are likely to be compatible.
The Legg-Hutter definition of intelligence is counterfactual (“if it had X goal, it would do a good job of achieving it”). It seems to me that the counterfactual definition isn’t necessary to capture the idea above. The LH definition also needs a measure over environments (including reward functions), and it’s not obvious how closely their proposed measure corresponds to things we’re interested in, while influentialness in the world we live in seems to correspond very closely.
The mesa-optimizer paper also stresses (not sure if correctly) that they’re not talking about input-output properties.
Influentialness in the real world definitely screens off training environment performance WRT impact in the real world, because that’s what the definition of influentialness is. If training environment performance fully screens off real world influentialness then this means that the outcome of training is essentially deterministic, which is not obviously always going to be true. Even if it is, kind of real world influence any given training environment maps to will be a bit random, which might ultimately yield a similar result.
When you say “we’ll reward intelligence”, your meaning is similar to when I say “training environment and real world might be highly compatible”.
I think an idea like Legg-Hutter intelligence could be useful for arguing that the environments are likely to be compatible.