To some extent this sounds like it’s already captured by the notion of intelligence as being able to achieve goals in a wide range of environments—mesa-optimizers will have some edge if they’re intelligent (or else why would they arise?). And this edge grows larger the more complicated stuff they’re expected to do.
Contrary to the middle of your post, I would expect the training environment to screen off the deployment environment—the influentialness of a future AI is going to be because the training environment rewarded intelligence, not because influentialness on the deployment environment somehow reaches back to bypass the training environment and affect the AI.
The Legg-Hutter definition of intelligence is counterfactual (“if it had X goal, it would do a good job of achieving it”). It seems to me that the counterfactual definition isn’t necessary to capture the idea above. The LH definition also needs a measure over environments (including reward functions), and it’s not obvious how closely their proposed measure corresponds to things we’re interested in, while influentialness in the world we live in seems to correspond very closely.
The mesa-optimizer paper also stresses (not sure if correctly) that they’re not talking about input-output properties.
Influentialness in the real world definitely screens off training environment performance WRT impact in the real world, because that’s what the definition of influentialness is. If training environment performance fully screens off real world influentialness then this means that the outcome of training is essentially deterministic, which is not obviously always going to be true. Even if it is, kind of real world influence any given training environment maps to will be a bit random, which might ultimately yield a similar result.
When you say “we’ll reward intelligence”, your meaning is similar to when I say “training environment and real world might be highly compatible”.
I think an idea like Legg-Hutter intelligence could be useful for arguing that the environments are likely to be compatible.
To some extent this sounds like it’s already captured by the notion of intelligence as being able to achieve goals in a wide range of environments—mesa-optimizers will have some edge if they’re intelligent (or else why would they arise?). And this edge grows larger the more complicated stuff they’re expected to do.
Contrary to the middle of your post, I would expect the training environment to screen off the deployment environment—the influentialness of a future AI is going to be because the training environment rewarded intelligence, not because influentialness on the deployment environment somehow reaches back to bypass the training environment and affect the AI.
The Legg-Hutter definition of intelligence is counterfactual (“if it had X goal, it would do a good job of achieving it”). It seems to me that the counterfactual definition isn’t necessary to capture the idea above. The LH definition also needs a measure over environments (including reward functions), and it’s not obvious how closely their proposed measure corresponds to things we’re interested in, while influentialness in the world we live in seems to correspond very closely.
The mesa-optimizer paper also stresses (not sure if correctly) that they’re not talking about input-output properties.
Influentialness in the real world definitely screens off training environment performance WRT impact in the real world, because that’s what the definition of influentialness is. If training environment performance fully screens off real world influentialness then this means that the outcome of training is essentially deterministic, which is not obviously always going to be true. Even if it is, kind of real world influence any given training environment maps to will be a bit random, which might ultimately yield a similar result.
When you say “we’ll reward intelligence”, your meaning is similar to when I say “training environment and real world might be highly compatible”.
I think an idea like Legg-Hutter intelligence could be useful for arguing that the environments are likely to be compatible.