I think there is probably a much simpler proposal that captures the spirt of this and doesn’t require any of these moving parts. I’ll think about this at some point.
Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training for POST and TD doesn’t run into the same barriers to generalization as we see when we consider training for honesty.
I think there should be a way to get the same guarantees that only requires considering a single different conditional which should be much easier to reason about.
Maybe something like “what would you do in the conditional where humanity gives you full arbitrary power”.
Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training for POST and TD doesn’t run into the same barriers to generalization as we see when we consider training for honesty.
I think there should be a way to get the same guarantees that only requires considering a single different conditional which should be much easier to reason about.
Maybe something like “what would you do in the conditional where humanity gives you full arbitrary power”.