I think of the proposed changes as coming from the reporter, or at least dependent on the reporter. Then, if the reporter does not have a good model of what is going on in the predictor beyond what a human could guess, it will be unable to propose a counterfactual predictor state.
The issue with the training process as you describe it is part 3. It would require a direct translator to train on the difference between the desired and given answer. Instead, we want to train the reporter to do two functions, answer questions and propose changes. We could also just use the question answering functionality to do search over predictor state space without understanding it until we find a state that gives the desired answers to a set of questions.
I don’t think (3) requires that we already have the direct translator? At least I was imagining that the proposal head proposes a change that produces a given answer. E.g. “Find a state so that the answer to my question is ‘pineapple’.”, then we penalize if the answer isn’t “pineapple”. But now that I write that I see that it’s easy to have that short-circuit badly by e.g. steganography in the proposed changes.
So I’m back to being confused: what’s the training process meant to be here?
Re-reading your prior comment, I think I misunderstood it initially.
Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.
Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a change has to be made to the predictor to have it answer a certain way to specific questions given an initial scenario.
This proposal also works as an “audit” at the end, checking a variety of counterfactuals in order to catch human simulators, but this does not suggest a change to the reporter. Instead, it is a sign to scrap everything and start over.
I think of the proposed changes as coming from the reporter, or at least dependent on the reporter. Then, if the reporter does not have a good model of what is going on in the predictor beyond what a human could guess, it will be unable to propose a counterfactual predictor state.
The issue with the training process as you describe it is part 3. It would require a direct translator to train on the difference between the desired and given answer. Instead, we want to train the reporter to do two functions, answer questions and propose changes. We could also just use the question answering functionality to do search over predictor state space without understanding it until we find a state that gives the desired answers to a set of questions.
I don’t think (3) requires that we already have the direct translator? At least I was imagining that the proposal head proposes a change that produces a given answer. E.g. “Find a state so that the answer to my question is ‘pineapple’.”, then we penalize if the answer isn’t “pineapple”. But now that I write that I see that it’s easy to have that short-circuit badly by e.g. steganography in the proposed changes.
So I’m back to being confused: what’s the training process meant to be here?
Re-reading your prior comment, I think I misunderstood it initially.
Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.
Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a change has to be made to the predictor to have it answer a certain way to specific questions given an initial scenario.
This proposal also works as an “audit” at the end, checking a variety of counterfactuals in order to catch human simulators, but this does not suggest a change to the reporter. Instead, it is a sign to scrap everything and start over.