You can use a different oracle for every subquestion, but it’s unclear what exactly that does if you don’t know what the oracle’s actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution.
The key here, I think, is the degree to which you’re willing to make an assumption of the form you mention—that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you’re allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you’re assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that’s a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn’t care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you’re assuming.
Well, a given copy of the oracle wouldn’t directly recieve information from the other oracles about the questions they were asked. To the extent a problem remains (which I agree is likely without specific assumptions), wouldn’t it apply to all counterfactual oracles?
Two basic questions I couldn’t figure out (sorry):
Can you use a different oracle for every subquestion? If you can, how would this affect the concern Wei_Dai raises?
If we know the oracle is only optimizing for the specified objective function, are mesa-optimisers still a problem for the proposed system as a whole?
You can use a different oracle for every subquestion, but it’s unclear what exactly that does if you don’t know what the oracle’s actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution.
The key here, I think, is the degree to which you’re willing to make an assumption of the form you mention—that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you’re allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you’re assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that’s a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn’t care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you’re assuming.
Well, a given copy of the oracle wouldn’t directly recieve information from the other oracles about the questions they were asked. To the extent a problem remains (which I agree is likely without specific assumptions), wouldn’t it apply to all counterfactual oracles?