I don’t think (3) requires that we already have the direct translator? At least I was imagining that the proposal head proposes a change that produces a given answer. E.g. “Find a state so that the answer to my question is ‘pineapple’.”, then we penalize if the answer isn’t “pineapple”. But now that I write that I see that it’s easy to have that short-circuit badly by e.g. steganography in the proposed changes.
So I’m back to being confused: what’s the training process meant to be here?
Re-reading your prior comment, I think I misunderstood it initially.
Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.
Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a change has to be made to the predictor to have it answer a certain way to specific questions given an initial scenario.
This proposal also works as an “audit” at the end, checking a variety of counterfactuals in order to catch human simulators, but this does not suggest a change to the reporter. Instead, it is a sign to scrap everything and start over.
I don’t think (3) requires that we already have the direct translator? At least I was imagining that the proposal head proposes a change that produces a given answer. E.g. “Find a state so that the answer to my question is ‘pineapple’.”, then we penalize if the answer isn’t “pineapple”. But now that I write that I see that it’s easy to have that short-circuit badly by e.g. steganography in the proposed changes.
So I’m back to being confused: what’s the training process meant to be here?
Re-reading your prior comment, I think I misunderstood it initially.
Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.
Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a change has to be made to the predictor to have it answer a certain way to specific questions given an initial scenario.
This proposal also works as an “audit” at the end, checking a variety of counterfactuals in order to catch human simulators, but this does not suggest a change to the reporter. Instead, it is a sign to scrap everything and start over.