Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.
I think there is probably a much simpler proposal that captures the spirt of this and doesn’t require any of these moving parts. I’ll think about this at some point. I think there should be a relatively simple and more intuitive way to make your AI expose it’s preferences if you’re willing to depend on arbitrarily far generalization, on getting your AI to care a huge amount about extremely unlikely conditionals, and on coordinating humanity in these unlikely conditionals.
I think there is probably a much simpler proposal that captures the spirt of this and doesn’t require any of these moving parts. I’ll think about this at some point.
Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training for POST and TD doesn’t run into the same barriers to generalization as we see when we consider training for honesty.
I think there should be a way to get the same guarantees that only requires considering a single different conditional which should be much easier to reason about.
Maybe something like “what would you do in the conditional where humanity gives you full arbitrary power”.
I think there is probably a much simpler proposal that captures the spirt of this and doesn’t require any of these moving parts. I’ll think about this at some point. I think there should be a relatively simple and more intuitive way to make your AI expose it’s preferences if you’re willing to depend on arbitrarily far generalization, on getting your AI to care a huge amount about extremely unlikely conditionals, and on coordinating humanity in these unlikely conditionals.
Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training for POST and TD doesn’t run into the same barriers to generalization as we see when we consider training for honesty.
I think there should be a way to get the same guarantees that only requires considering a single different conditional which should be much easier to reason about.
Maybe something like “what would you do in the conditional where humanity gives you full arbitrary power”.