I mean, “I want X” is shorter than “I want to figure out what the environment wants” even in english^^. Complexity depends on language, so I’m not sure what would be promoted by real gradient descent, but something like “maximize knowledge” seems simpler than “maximize knowledge and don’t kill humans”. Maybe it will resemble base objective more if mesaoptimiser discovers how to stabilize it’s own utility function earlier, maybe it will depend on what is efficiently encodable directly in weights. But if you need to encode procedure for figuring out the environment anyway, there is no reason for the base objective to be encodable as a shorter additional code—you would more likely already have functions for estimating energy that you can reuse than functions for estimating parts of base objective.
Though it does still have higher simplicity cost.
The difference in complexity is c(get_base_obj) vs c(simplest_obj) + c(is_deployed_in_world), right? So if base objective is more complex, then deceiving is simpler.
I think we’re talking past each other. The difference seems to me to be that you think there is no efficient encoding of human-ish values that is more efficient than backsolving whatever current subset of humans values are required in the current environment, whereas I think that given sufficiently diverse environments requiring the exhibiting of different human values, a fixed encoding is actually most efficient.
For example, imagine an environment where you are a chatbot rewarded for positive customer service interactions. An agent with the super-reductive version of human values which is just “do whatever it seems like humans want” wakes up, sees humans seem to want cordial customer service, and does that.
On the other hand, an agent which wants to make paperclips wakes up, sees that it can’t make many paperclips right now but that it may be able to in the future, realizes it needs to maximize its current environment, rederives that it should do whatever its overseers want, and only then sees humans seem to want cordial customer service, and then does that.
Either both or neither of the agents get to amortize the cost of “see humans seem to want cordial customer service”. Certainly if it’s that easy to derive “what the agent should do in every environment”, I don’t see why a non-deceptive mesaoptimizer wouldn’t benefit from the same strategy.
If your claim is “this isn’t a solution to misaligned mesaoptimizers, only to deceptive misaligned mesaoptimizers”, then yes absolutely I wouldn’t claim otherwise.
Oh, yes, I actually missed that this was not supposed to solve misaligned mesaoptimizers because of “well-aligned” in “Fast Honest Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment’s current objective.”. Rechecking with new context… Well, not sure if it’s new context, but I now see that optimizer with check that derives what humans want should be the same as honest one if the check is never satisfied and so it would have at least the same complexity, which I neglected because I didn’t think what “it proceeds to optimize the objective it’s supposed to” means in detail. So you’re right, it’s either slower or more complex.
I mean, “I want X” is shorter than “I want to figure out what the environment wants” even in english^^. Complexity depends on language, so I’m not sure what would be promoted by real gradient descent, but something like “maximize knowledge” seems simpler than “maximize knowledge and don’t kill humans”. Maybe it will resemble base objective more if mesaoptimiser discovers how to stabilize it’s own utility function earlier, maybe it will depend on what is efficiently encodable directly in weights. But if you need to encode procedure for figuring out the environment anyway, there is no reason for the base objective to be encodable as a shorter additional code—you would more likely already have functions for estimating energy that you can reuse than functions for estimating parts of base objective.
The difference in complexity is c(get_base_obj) vs c(simplest_obj) + c(is_deployed_in_world), right? So if base objective is more complex, then deceiving is simpler.
I think we’re talking past each other. The difference seems to me to be that you think there is no efficient encoding of human-ish values that is more efficient than backsolving whatever current subset of humans values are required in the current environment, whereas I think that given sufficiently diverse environments requiring the exhibiting of different human values, a fixed encoding is actually most efficient.
For example, imagine an environment where you are a chatbot rewarded for positive customer service interactions. An agent with the super-reductive version of human values which is just “do whatever it seems like humans want” wakes up, sees humans seem to want cordial customer service, and does that.
On the other hand, an agent which wants to make paperclips wakes up, sees that it can’t make many paperclips right now but that it may be able to in the future, realizes it needs to maximize its current environment, rederives that it should do whatever its overseers want, and only then sees humans seem to want cordial customer service, and then does that.
Either both or neither of the agents get to amortize the cost of “see humans seem to want cordial customer service”. Certainly if it’s that easy to derive “what the agent should do in every environment”, I don’t see why a non-deceptive mesaoptimizer wouldn’t benefit from the same strategy.
If your claim is “this isn’t a solution to misaligned mesaoptimizers, only to deceptive misaligned mesaoptimizers”, then yes absolutely I wouldn’t claim otherwise.
Oh, yes, I actually missed that this was not supposed to solve misaligned mesaoptimizers because of “well-aligned” in “Fast Honest Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment’s current objective.”. Rechecking with new context… Well, not sure if it’s new context, but I now see that optimizer with check that derives what humans want should be the same as honest one if the check is never satisfied and so it would have at least the same complexity, which I neglected because I didn’t think what “it proceeds to optimize the objective it’s supposed to” means in detail. So you’re right, it’s either slower or more complex.
Edited! Thanks for this discussion.