I think you might be conflating two different scenarios?
I present alternative strategy for a mesaoptimizer that, yes, wasn’t in the post, by I don’t see why?
Is your claim that that’s basically nothing?
Yes, or at least it approaches relatively nothing as we get more competent optimizer.
If it’s always super easy to identify the base objective and then optimize it, then Hfh shouldn’t need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective.
If it doesn’t store it, it isn’t Hfh - it would be able to derive that humans want it, but wouldn’t want to optimize it itself.
What would it want to optimize, then, according to you? I’m claiming that “I want to figure out what the environment wants and then do it” is a simpler goal than “I want X, to get that I’m going to figure out what the environment wants and then do it”
Re using both, you’re right, if you make your other assumptions then using both could work. (Though it does still have higher simplicity cost.)
I mean, “I want X” is shorter than “I want to figure out what the environment wants” even in english^^. Complexity depends on language, so I’m not sure what would be promoted by real gradient descent, but something like “maximize knowledge” seems simpler than “maximize knowledge and don’t kill humans”. Maybe it will resemble base objective more if mesaoptimiser discovers how to stabilize it’s own utility function earlier, maybe it will depend on what is efficiently encodable directly in weights. But if you need to encode procedure for figuring out the environment anyway, there is no reason for the base objective to be encodable as a shorter additional code—you would more likely already have functions for estimating energy that you can reuse than functions for estimating parts of base objective.
Though it does still have higher simplicity cost.
The difference in complexity is c(get_base_obj) vs c(simplest_obj) + c(is_deployed_in_world), right? So if base objective is more complex, then deceiving is simpler.
I think we’re talking past each other. The difference seems to me to be that you think there is no efficient encoding of human-ish values that is more efficient than backsolving whatever current subset of humans values are required in the current environment, whereas I think that given sufficiently diverse environments requiring the exhibiting of different human values, a fixed encoding is actually most efficient.
For example, imagine an environment where you are a chatbot rewarded for positive customer service interactions. An agent with the super-reductive version of human values which is just “do whatever it seems like humans want” wakes up, sees humans seem to want cordial customer service, and does that.
On the other hand, an agent which wants to make paperclips wakes up, sees that it can’t make many paperclips right now but that it may be able to in the future, realizes it needs to maximize its current environment, rederives that it should do whatever its overseers want, and only then sees humans seem to want cordial customer service, and then does that.
Either both or neither of the agents get to amortize the cost of “see humans seem to want cordial customer service”. Certainly if it’s that easy to derive “what the agent should do in every environment”, I don’t see why a non-deceptive mesaoptimizer wouldn’t benefit from the same strategy.
If your claim is “this isn’t a solution to misaligned mesaoptimizers, only to deceptive misaligned mesaoptimizers”, then yes absolutely I wouldn’t claim otherwise.
Oh, yes, I actually missed that this was not supposed to solve misaligned mesaoptimizers because of “well-aligned” in “Fast Honest Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment’s current objective.”. Rechecking with new context… Well, not sure if it’s new context, but I now see that optimizer with check that derives what humans want should be the same as honest one if the check is never satisfied and so it would have at least the same complexity, which I neglected because I didn’t think what “it proceeds to optimize the objective it’s supposed to” means in detail. So you’re right, it’s either slower or more complex.
I present alternative strategy for a mesaoptimizer that, yes, wasn’t in the post, by I don’t see why?
Yes, or at least it approaches relatively nothing as we get more competent optimizer.
If it doesn’t store it, it isn’t Hfh - it would be able to derive that humans want it, but wouldn’t want to optimize it itself.
What would it want to optimize, then, according to you? I’m claiming that “I want to figure out what the environment wants and then do it” is a simpler goal than “I want X, to get that I’m going to figure out what the environment wants and then do it”
Re using both, you’re right, if you make your other assumptions then using both could work. (Though it does still have higher simplicity cost.)
I mean, “I want X” is shorter than “I want to figure out what the environment wants” even in english^^. Complexity depends on language, so I’m not sure what would be promoted by real gradient descent, but something like “maximize knowledge” seems simpler than “maximize knowledge and don’t kill humans”. Maybe it will resemble base objective more if mesaoptimiser discovers how to stabilize it’s own utility function earlier, maybe it will depend on what is efficiently encodable directly in weights. But if you need to encode procedure for figuring out the environment anyway, there is no reason for the base objective to be encodable as a shorter additional code—you would more likely already have functions for estimating energy that you can reuse than functions for estimating parts of base objective.
The difference in complexity is c(get_base_obj) vs c(simplest_obj) + c(is_deployed_in_world), right? So if base objective is more complex, then deceiving is simpler.
I think we’re talking past each other. The difference seems to me to be that you think there is no efficient encoding of human-ish values that is more efficient than backsolving whatever current subset of humans values are required in the current environment, whereas I think that given sufficiently diverse environments requiring the exhibiting of different human values, a fixed encoding is actually most efficient.
For example, imagine an environment where you are a chatbot rewarded for positive customer service interactions. An agent with the super-reductive version of human values which is just “do whatever it seems like humans want” wakes up, sees humans seem to want cordial customer service, and does that.
On the other hand, an agent which wants to make paperclips wakes up, sees that it can’t make many paperclips right now but that it may be able to in the future, realizes it needs to maximize its current environment, rederives that it should do whatever its overseers want, and only then sees humans seem to want cordial customer service, and then does that.
Either both or neither of the agents get to amortize the cost of “see humans seem to want cordial customer service”. Certainly if it’s that easy to derive “what the agent should do in every environment”, I don’t see why a non-deceptive mesaoptimizer wouldn’t benefit from the same strategy.
If your claim is “this isn’t a solution to misaligned mesaoptimizers, only to deceptive misaligned mesaoptimizers”, then yes absolutely I wouldn’t claim otherwise.
Oh, yes, I actually missed that this was not supposed to solve misaligned mesaoptimizers because of “well-aligned” in “Fast Honest Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment’s current objective.”. Rechecking with new context… Well, not sure if it’s new context, but I now see that optimizer with check that derives what humans want should be the same as honest one if the check is never satisfied and so it would have at least the same complexity, which I neglected because I didn’t think what “it proceeds to optimize the objective it’s supposed to” means in detail. So you’re right, it’s either slower or more complex.
Edited! Thanks for this discussion.