I was thinking about this, and it’s a bit unclear.
First, if you’re willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures. In this situation, you’re guarded against any possible manipulation incentive like that, but it also means your oracle will very rarely actually be given a reward in practice, which means if you’re relying on getting enough training data to produce an agent which will optimize for this objective, you’re screwed. I would argue, however, that if you expect to train an agent to behave as a counterfactual oracle in the first place, you’re already screwed, because most mesa-optimizers will care about things other than just the counterfactual case. Thus, the only situation in which this whole thing works in the first place is the situation where you’re already willing to make this (very strong) assumption, so it’s fine.
Second, I don’t think you’re entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure. For example, you could increase the probability of an erasure with each subquestion, or scale the reward exponentially with the depth at which the erasure occurs, so that the majority of the expected reward is always concentrated in the world where there is a complete erasure.
First, if you’re willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures.
But if all subquestions have erasures, humans would have to manually execute the whole query tree, which is exponentially large so you’ll run out of resources (in the counterfactual world) if you tried to do that, so the Oracle won’t be able to give you a useful prediction. Wouldn’t it make more sense to have the Oracle make a prediction about a counterfactual world where some humans just think normally for a while and write down their thoughts (similar to my “predict the best AF posts” idea)? I don’t see what value the IDA idea is adding here.
Second, I don’t think you’re entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure.
Given the above, “only provide rewards in the event of a complete erasure” doesn’t seem to make sense as a target to approximate. Do you think your ideas in this paragraph still have value in light of that?
Okay, here’s another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step’s worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict.
Maybe we can turn this into a zero-sum game, though? Here’s a proposal: let M′ be a copy of M and Qtree be the set of all questions in the current tree that also get erasures. Then, let
LM(Q)=d(H(Q|M′),M(Q))−1|Qtree|−1∑Q′∈Qtree−{Q}d(H(Q′|M′),M′(Q′))
such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It’s still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures.
It is also worth noting, however, that even if this works it is a very artificial fix, since the term you’re subtracting is a constant with no dependence on M(Q), so if you’re trying to do gradient descent to optimize this loss, it won’t change anything at all (which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we’re still back at the problem of none of this working unless you’re willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.
which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives
Sorry I haven’t followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?
In the case of “actual” IDA, I guess the plan is for each overseer to look inside the model they’re training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I’m not sure how that can happen at the lower levels where the overseers are not very smart.
Even if you can specify that it tries to minimize that distance, it can make the answer to any query be a convincing argument that the reader should return this same convincing argument. That way, it scores perfectly on every inner node.
I was thinking about this, and it’s a bit unclear.
First, if you’re willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures. In this situation, you’re guarded against any possible manipulation incentive like that, but it also means your oracle will very rarely actually be given a reward in practice, which means if you’re relying on getting enough training data to produce an agent which will optimize for this objective, you’re screwed. I would argue, however, that if you expect to train an agent to behave as a counterfactual oracle in the first place, you’re already screwed, because most mesa-optimizers will care about things other than just the counterfactual case. Thus, the only situation in which this whole thing works in the first place is the situation where you’re already willing to make this (very strong) assumption, so it’s fine.
Second, I don’t think you’re entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure. For example, you could increase the probability of an erasure with each subquestion, or scale the reward exponentially with the depth at which the erasure occurs, so that the majority of the expected reward is always concentrated in the world where there is a complete erasure.
But if all subquestions have erasures, humans would have to manually execute the whole query tree, which is exponentially large so you’ll run out of resources (in the counterfactual world) if you tried to do that, so the Oracle won’t be able to give you a useful prediction. Wouldn’t it make more sense to have the Oracle make a prediction about a counterfactual world where some humans just think normally for a while and write down their thoughts (similar to my “predict the best AF posts” idea)? I don’t see what value the IDA idea is adding here.
Given the above, “only provide rewards in the event of a complete erasure” doesn’t seem to make sense as a target to approximate. Do you think your ideas in this paragraph still have value in light of that?
Yeah, that’s a good point.
Okay, here’s another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step’s worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict.
Maybe we can turn this into a zero-sum game, though? Here’s a proposal: let M′ be a copy of M and Qtree be the set of all questions in the current tree that also get erasures. Then, let LM(Q)=d(H(Q|M′),M(Q))−1|Qtree|−1∑Q′∈Qtree−{Q}d(H(Q′|M′),M′(Q′)) such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It’s still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures.
It is also worth noting, however, that even if this works it is a very artificial fix, since the term you’re subtracting is a constant with no dependence on M(Q), so if you’re trying to do gradient descent to optimize this loss, it won’t change anything at all (which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we’re still back at the problem of none of this working unless you’re willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.
Sorry I haven’t followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?
In the case of “actual” IDA, I guess the plan is for each overseer to look inside the model they’re training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I’m not sure how that can happen at the lower levels where the overseers are not very smart.
Even if you can specify that it tries to minimize that distance, it can make the answer to any query be a convincing argument that the reader should return this same convincing argument. That way, it scores perfectly on every inner node.