I think we’re talking past each other. The difference seems to me to be that you think there is no efficient encoding of human-ish values that is more efficient than backsolving whatever current subset of humans values are required in the current environment, whereas I think that given sufficiently diverse environments requiring the exhibiting of different human values, a fixed encoding is actually most efficient.
For example, imagine an environment where you are a chatbot rewarded for positive customer service interactions. An agent with the super-reductive version of human values which is just “do whatever it seems like humans want” wakes up, sees humans seem to want cordial customer service, and does that.
On the other hand, an agent which wants to make paperclips wakes up, sees that it can’t make many paperclips right now but that it may be able to in the future, realizes it needs to maximize its current environment, rederives that it should do whatever its overseers want, and only then sees humans seem to want cordial customer service, and then does that.
Either both or neither of the agents get to amortize the cost of “see humans seem to want cordial customer service”. Certainly if it’s that easy to derive “what the agent should do in every environment”, I don’t see why a non-deceptive mesaoptimizer wouldn’t benefit from the same strategy.
If your claim is “this isn’t a solution to misaligned mesaoptimizers, only to deceptive misaligned mesaoptimizers”, then yes absolutely I wouldn’t claim otherwise.
Oh, yes, I actually missed that this was not supposed to solve misaligned mesaoptimizers because of “well-aligned” in “Fast Honest Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment’s current objective.”. Rechecking with new context… Well, not sure if it’s new context, but I now see that optimizer with check that derives what humans want should be the same as honest one if the check is never satisfied and so it would have at least the same complexity, which I neglected because I didn’t think what “it proceeds to optimize the objective it’s supposed to” means in detail. So you’re right, it’s either slower or more complex.
I think we’re talking past each other. The difference seems to me to be that you think there is no efficient encoding of human-ish values that is more efficient than backsolving whatever current subset of humans values are required in the current environment, whereas I think that given sufficiently diverse environments requiring the exhibiting of different human values, a fixed encoding is actually most efficient.
For example, imagine an environment where you are a chatbot rewarded for positive customer service interactions. An agent with the super-reductive version of human values which is just “do whatever it seems like humans want” wakes up, sees humans seem to want cordial customer service, and does that.
On the other hand, an agent which wants to make paperclips wakes up, sees that it can’t make many paperclips right now but that it may be able to in the future, realizes it needs to maximize its current environment, rederives that it should do whatever its overseers want, and only then sees humans seem to want cordial customer service, and then does that.
Either both or neither of the agents get to amortize the cost of “see humans seem to want cordial customer service”. Certainly if it’s that easy to derive “what the agent should do in every environment”, I don’t see why a non-deceptive mesaoptimizer wouldn’t benefit from the same strategy.
If your claim is “this isn’t a solution to misaligned mesaoptimizers, only to deceptive misaligned mesaoptimizers”, then yes absolutely I wouldn’t claim otherwise.
Oh, yes, I actually missed that this was not supposed to solve misaligned mesaoptimizers because of “well-aligned” in “Fast Honest Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment’s current objective.”. Rechecking with new context… Well, not sure if it’s new context, but I now see that optimizer with check that derives what humans want should be the same as honest one if the check is never satisfied and so it would have at least the same complexity, which I neglected because I didn’t think what “it proceeds to optimize the objective it’s supposed to” means in detail. So you’re right, it’s either slower or more complex.
Edited! Thanks for this discussion.