I found this post clarifying. One thing I’m still uncertain of: what’s the architecture of the Reporter in this proposal? Does it have two heads, one for proposing changes to the Predictor’s state and one for answering questions? If so, can I think of the training process as:
Use the proposal head to get a proposed change.
Change the latent state of the Predictor.
Ask a question and see if the answer head gives the desired answer in the new state.
Train the proposal head on the difference between the desired answer and the given answer.
Separately, train the answer head on lots of counterfactual questions now that we have the ability to pose counterfactuals about the vault.
I think of the proposed changes as coming from the reporter, or at least dependent on the reporter. Then, if the reporter does not have a good model of what is going on in the predictor beyond what a human could guess, it will be unable to propose a counterfactual predictor state.
The issue with the training process as you describe it is part 3. It would require a direct translator to train on the difference between the desired and given answer. Instead, we want to train the reporter to do two functions, answer questions and propose changes. We could also just use the question answering functionality to do search over predictor state space without understanding it until we find a state that gives the desired answers to a set of questions.
I don’t think (3) requires that we already have the direct translator? At least I was imagining that the proposal head proposes a change that produces a given answer. E.g. “Find a state so that the answer to my question is ‘pineapple’.”, then we penalize if the answer isn’t “pineapple”. But now that I write that I see that it’s easy to have that short-circuit badly by e.g. steganography in the proposed changes.
So I’m back to being confused: what’s the training process meant to be here?
Re-reading your prior comment, I think I misunderstood it initially.
Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.
Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a change has to be made to the predictor to have it answer a certain way to specific questions given an initial scenario.
This proposal also works as an “audit” at the end, checking a variety of counterfactuals in order to catch human simulators, but this does not suggest a change to the reporter. Instead, it is a sign to scrap everything and start over.
I found this post clarifying. One thing I’m still uncertain of: what’s the architecture of the Reporter in this proposal? Does it have two heads, one for proposing changes to the Predictor’s state and one for answering questions? If so, can I think of the training process as:
Use the proposal head to get a proposed change.
Change the latent state of the Predictor.
Ask a question and see if the answer head gives the desired answer in the new state.
Train the proposal head on the difference between the desired answer and the given answer.
Separately, train the answer head on lots of counterfactual questions now that we have the ability to pose counterfactuals about the vault.
Is that right?
I think of the proposed changes as coming from the reporter, or at least dependent on the reporter. Then, if the reporter does not have a good model of what is going on in the predictor beyond what a human could guess, it will be unable to propose a counterfactual predictor state.
The issue with the training process as you describe it is part 3. It would require a direct translator to train on the difference between the desired and given answer. Instead, we want to train the reporter to do two functions, answer questions and propose changes. We could also just use the question answering functionality to do search over predictor state space without understanding it until we find a state that gives the desired answers to a set of questions.
I don’t think (3) requires that we already have the direct translator? At least I was imagining that the proposal head proposes a change that produces a given answer. E.g. “Find a state so that the answer to my question is ‘pineapple’.”, then we penalize if the answer isn’t “pineapple”. But now that I write that I see that it’s easy to have that short-circuit badly by e.g. steganography in the proposed changes.
So I’m back to being confused: what’s the training process meant to be here?
Re-reading your prior comment, I think I misunderstood it initially.
Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.
Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a change has to be made to the predictor to have it answer a certain way to specific questions given an initial scenario.
This proposal also works as an “audit” at the end, checking a variety of counterfactuals in order to catch human simulators, but this does not suggest a change to the reporter. Instead, it is a sign to scrap everything and start over.