My submission: use a counterfactual oracle to do IDA. To do this, judge the oracle’s answer as follows: in the event of an erasure, ask a human to answer the question with the ability to consult the oracle on other subquestions (with erasures during those calls handled exactly the same way) and then reward the oracle by using an automated system (e.g. a language model) to compare the human’s answer to that of the oracle.
More precisely, let
M:Q→A be the counterfactual oracle,
H(Q|M) be the human’s answer to question Q∈Q when given the ability to call M on any question other than Q, and
d:A×A→R be some distance metric on answers in natural language (it’s not that hard to make something like this, even with current ML tools).
Then, reward M as per usual for a counterfactual oracle, only giving it a reward in the event of an erasure, in which case let
LM=d(H(Q|M),M(Q))
where M(Q) is hidden from H and judged only by d as in the standard counterfactual oracle setup.
(Of course, this doesn’t actually work because it has no guarantees wrt to inner alignment, but I think it has a pretty good shot of being outer aligned.)
Is it safe to ask the Oracle a subquestion in the event of erasure? Aren’t you risking having the Oracle produce an answer that is (in part) optimized to make it easier to predict the answer to the main question, instead of just the best prediction of how the human would answer that subquestion? (Sorry if this has already been addressed during a previous discussion of counterfactual oracles, because I haven’t been following it closely.)
I’m not sure I understand the concern. Isn’t the oracle answering each question to maximize its payoff on that question in event of an erasure? So it doesn’t matter if you ask it other questions during the evaluation period. (If you like, you can say that you are asking them to other oracles—or is there some way that an oracle is a distinguished part of the environment?)
If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don’t we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it’s a problem regardless.
I guess you can construct a model where the oracle does what you want, but only if you don’t ask any other oracles questions during the evaluation period, but it’s not clear to me how you would end up in that situation and at that point it seems worth trying to flesh out a more precise model.
Yeah, I’m not sure I understand the concern either, hence the tentative way in which I stated it. :) I think your objection to my concern is a reasonable one and I’ve been thinking about it myself. One thing I’ve come up with is that with the nested queries, the higher level Oracles could use simulation warfare to make the lower level Oracles answer the way that they “want”, whereas the same thing doesn’t seem to be true in the sequential case (if we make it so that in both cases each Oracle cares about just performance on the current question).
I mean, if the oracle hasn’t yet looked at the question they could use simulation warfare to cause the preceding oracles to take actions that lead to them getting given easier questions. Once you start unbarring all holds, stuff gets wild.
Yes, but if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case.
Also, aside from simulation warfare, another way that the nested case can be manipulative and the sequential case not is if each Oracle cares about doing well on a fixed distribution of inputs (as opposed to doing well “on the current question” or “in the real world” or “on the actual questions that it gets”). That’s because in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn’t improve performance on any particular given input. In the nested case, performance on given inputs do increase.
in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn’t improve performance on any particular given input
Why is that? Doesn’t my behavior on question #1 affect both question #2 and its answer?
Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.
Why is that? Doesn’t my behavior on question #1 affect both question #2 and its answer?
I was assuming each “question” actually includes as much relevant history as we can gather about the world, to make the Oracle’s job easier, and in particular it would include all previous Oracle questions/answers, in which case if Oracle #1 does X to make question #2 easier, it was already that easy because the only world in which question #2 gets asked is one in which Oracle #1 did X. But now I realize that’s not actually a safe assumption because Oracle #1 could break out of its box and feed Oracle #2 a false history that doesn’t include X.
My point about “if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case” still stands though, right?
Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.
You may well be right about this, but I’m not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?
You may well be right about this, but I’m not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?
What I want: “There is a model in the class that has property P. Training will find a model with property P.”
What I don’t want: “The best way to get a high reward is to have property P. Therefore a model that is trying to get a high reward will have property P.”
Example of what I don’t want: “Manipulative actions don’t help get a high reward (at least for the episodic reward function we intended), so the model won’t produce manipulative actions.”
So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis:
Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren’t asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation.
On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume?
ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that’s a useful direction to think?
So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis:
This is an objection to reasoning from incentives, but it’s stronger in the case of some kinds of reasoning from incentives (e.g. where incentives come apart from “what kind of policy would be selected under a plausible objective”). It’s hard for me to see how nested vs. sequential really matters here.
On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume?
(I don’t think model class is going to matter much.)
I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure.
(Though I don’t think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart’s posts about “forward-looking” vs. “backwards-looking” oracles?)
I think it’s also interesting to imagine internal RL (e.g. there are internal randomized cognitive actions, and we use REINFORCE to get gradient estimates—i.e. you try to increase the probability of cognitive actions taken in rounds where you got a lower loss than predicted, and decrease the probability of actions taken in rounds where you got a higher loss), which might make the setting a bit more like the one Stuart is imagining.
ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that’s a useful direction to think?
Seems like the counterfactually issue doesn’t come up in the Opt case, since you aren’t training the algorithm incrementally—you’d just collect a relevant dataset before you started training. I think the Opt setting throws away too much for analyzing this kind of situation, and would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).
I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure.
This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn’t that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future?
(Though I don’t think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart’s posts about “forward-looking” vs. “backwards-looking” oracles?)
I don’t understand what you mean in this paragraph (especially “since each possible parameter setting is being evaluated on what other parameter settings say anyway”), even after reading Stuart’s post, plus Stuart has changed his mind and no longer endorses the conclusions in that post. I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart’s reasons for changing his mind? (Or talk to him again and get him to write the post for you. :)
would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).
Couldn’t you simulate that with Opt by just running it repeatedly?
This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn’t that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future?
“The best model” is usually regularized. I don’t think this really changes the picture compared to imagining optimizing over some smaller space (e.g. space of models with regularize<x). In particular, I don’t think my intuitions are sensitive to the difference.
I don’t understand what you mean in this paragraph (especially “since each possible parameter setting is being evaluated on what other parameter settings say anyway”)
The normal procedure is: I gather data, and am using the model (and other ML models) while I’m gathering data. I search over parameters to find the ones that would make the best predictions on that data.
I’m not finding parameters that result in good predictive accuracy when used in the world. I’m generating some data, and then finding the parameters that make the best predictions about that data. That data was collected in a world where there are plenty of ML systems (including potentially a version of my oracle with different parameters).
Yes, the normal procedure converges to a fixed point. But why do we care / why is that bad?
I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart’s reasons for changing his mind? (Or talk to him again and get him to write the post for you. :)
I take a perspective where I want to use ML techniques (or other AI algorithms) to do useful work, without introducing powerful optimization working at cross-purposes to humans. On that perspective I don’t think any of this is a problem (or if you look at it another way, it wouldn’t be a problem if you had a solution that had any chance at all of working).
I don’t think Stuart is thinking about it in this way, so it’s hard to engage at the object level, and I don’t really know what the alternative perspective is, so I also don’t know how to engage at the meta level.
Is there a particular claim where you think there is an interesting disagreement?
Couldn’t you simulate that with Opt by just running it repeatedly?
If I care about competitiveness, rerunning OPT for every new datapoint is pretty bad. (I don’t think this is very important in the current context, nothing depends on competitiveness.)
Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.
Does anyone know what Paul meant by this? I’m afraid I might be missing some relatively simple but important insight here.
If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don’t we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it’s a problem regardless.
Yeah, that’s a good point. In my most recent response to Wei Dai I was trying to develop a loss which would prevent that sort of coordination, but it does seem like if that’s happening then it’s a problem in any counterfactual oracle setup, not just this one. Though it is thus still a problem you’d have to solve if you ever actually wanted to implement a counterfactual oracle.
I was thinking about this, and it’s a bit unclear.
First, if you’re willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures. In this situation, you’re guarded against any possible manipulation incentive like that, but it also means your oracle will very rarely actually be given a reward in practice, which means if you’re relying on getting enough training data to produce an agent which will optimize for this objective, you’re screwed. I would argue, however, that if you expect to train an agent to behave as a counterfactual oracle in the first place, you’re already screwed, because most mesa-optimizers will care about things other than just the counterfactual case. Thus, the only situation in which this whole thing works in the first place is the situation where you’re already willing to make this (very strong) assumption, so it’s fine.
Second, I don’t think you’re entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure. For example, you could increase the probability of an erasure with each subquestion, or scale the reward exponentially with the depth at which the erasure occurs, so that the majority of the expected reward is always concentrated in the world where there is a complete erasure.
First, if you’re willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures.
But if all subquestions have erasures, humans would have to manually execute the whole query tree, which is exponentially large so you’ll run out of resources (in the counterfactual world) if you tried to do that, so the Oracle won’t be able to give you a useful prediction. Wouldn’t it make more sense to have the Oracle make a prediction about a counterfactual world where some humans just think normally for a while and write down their thoughts (similar to my “predict the best AF posts” idea)? I don’t see what value the IDA idea is adding here.
Second, I don’t think you’re entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure.
Given the above, “only provide rewards in the event of a complete erasure” doesn’t seem to make sense as a target to approximate. Do you think your ideas in this paragraph still have value in light of that?
Okay, here’s another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step’s worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict.
Maybe we can turn this into a zero-sum game, though? Here’s a proposal: let M′ be a copy of M and Qtree be the set of all questions in the current tree that also get erasures. Then, let
LM(Q)=d(H(Q|M′),M(Q))−1|Qtree|−1∑Q′∈Qtree−{Q}d(H(Q′|M′),M′(Q′))
such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It’s still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures.
It is also worth noting, however, that even if this works it is a very artificial fix, since the term you’re subtracting is a constant with no dependence on M(Q), so if you’re trying to do gradient descent to optimize this loss, it won’t change anything at all (which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we’re still back at the problem of none of this working unless you’re willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.
which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives
Sorry I haven’t followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?
In the case of “actual” IDA, I guess the plan is for each overseer to look inside the model they’re training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I’m not sure how that can happen at the lower levels where the overseers are not very smart.
Even if you can specify that it tries to minimize that distance, it can make the answer to any query be a convincing argument that the reader should return this same convincing argument. That way, it scores perfectly on every inner node.
You can use a different oracle for every subquestion, but it’s unclear what exactly that does if you don’t know what the oracle’s actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution.
The key here, I think, is the degree to which you’re willing to make an assumption of the form you mention—that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you’re allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you’re assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that’s a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn’t care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you’re assuming.
Well, a given copy of the oracle wouldn’t directly recieve information from the other oracles about the questions they were asked. To the extent a problem remains (which I agree is likely without specific assumptions), wouldn’t it apply to all counterfactual oracles?
My submission: use a counterfactual oracle to do IDA. To do this, judge the oracle’s answer as follows: in the event of an erasure, ask a human to answer the question with the ability to consult the oracle on other subquestions (with erasures during those calls handled exactly the same way) and then reward the oracle by using an automated system (e.g. a language model) to compare the human’s answer to that of the oracle.
More precisely, let
M:Q→A be the counterfactual oracle,
H(Q|M) be the human’s answer to question Q∈Q when given the ability to call M on any question other than Q, and
d:A×A→R be some distance metric on answers in natural language (it’s not that hard to make something like this, even with current ML tools).
Then, reward M as per usual for a counterfactual oracle, only giving it a reward in the event of an erasure, in which case let LM=d(H(Q|M),M(Q)) where M(Q) is hidden from H and judged only by d as in the standard counterfactual oracle setup.
(Of course, this doesn’t actually work because it has no guarantees wrt to inner alignment, but I think it has a pretty good shot of being outer aligned.)
Is it safe to ask the Oracle a subquestion in the event of erasure? Aren’t you risking having the Oracle produce an answer that is (in part) optimized to make it easier to predict the answer to the main question, instead of just the best prediction of how the human would answer that subquestion? (Sorry if this has already been addressed during a previous discussion of counterfactual oracles, because I haven’t been following it closely.)
I’m not sure I understand the concern. Isn’t the oracle answering each question to maximize its payoff on that question in event of an erasure? So it doesn’t matter if you ask it other questions during the evaluation period. (If you like, you can say that you are asking them to other oracles—or is there some way that an oracle is a distinguished part of the environment?)
If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don’t we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it’s a problem regardless.
I guess you can construct a model where the oracle does what you want, but only if you don’t ask any other oracles questions during the evaluation period, but it’s not clear to me how you would end up in that situation and at that point it seems worth trying to flesh out a more precise model.
Yeah, I’m not sure I understand the concern either, hence the tentative way in which I stated it. :) I think your objection to my concern is a reasonable one and I’ve been thinking about it myself. One thing I’ve come up with is that with the nested queries, the higher level Oracles could use simulation warfare to make the lower level Oracles answer the way that they “want”, whereas the same thing doesn’t seem to be true in the sequential case (if we make it so that in both cases each Oracle cares about just performance on the current question).
I mean, if the oracle hasn’t yet looked at the question they could use simulation warfare to cause the preceding oracles to take actions that lead to them getting given easier questions. Once you start unbarring all holds, stuff gets wild.
Yes, but if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case.
Also, aside from simulation warfare, another way that the nested case can be manipulative and the sequential case not is if each Oracle cares about doing well on a fixed distribution of inputs (as opposed to doing well “on the current question” or “in the real world” or “on the actual questions that it gets”). That’s because in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn’t improve performance on any particular given input. In the nested case, performance on given inputs do increase.
Why is that? Doesn’t my behavior on question #1 affect both question #2 and its answer?
Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.
I was assuming each “question” actually includes as much relevant history as we can gather about the world, to make the Oracle’s job easier, and in particular it would include all previous Oracle questions/answers, in which case if Oracle #1 does X to make question #2 easier, it was already that easy because the only world in which question #2 gets asked is one in which Oracle #1 did X. But now I realize that’s not actually a safe assumption because Oracle #1 could break out of its box and feed Oracle #2 a false history that doesn’t include X.
My point about “if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case” still stands though, right?
You may well be right about this, but I’m not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?
What I want: “There is a model in the class that has property P. Training will find a model with property P.”
What I don’t want: “The best way to get a high reward is to have property P. Therefore a model that is trying to get a high reward will have property P.”
Example of what I don’t want: “Manipulative actions don’t help get a high reward (at least for the episodic reward function we intended), so the model won’t produce manipulative actions.”
So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis:
On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume?
ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that’s a useful direction to think?
This is an objection to reasoning from incentives, but it’s stronger in the case of some kinds of reasoning from incentives (e.g. where incentives come apart from “what kind of policy would be selected under a plausible objective”). It’s hard for me to see how nested vs. sequential really matters here.
(I don’t think model class is going to matter much.)
I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure.
(Though I don’t think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart’s posts about “forward-looking” vs. “backwards-looking” oracles?)
I think it’s also interesting to imagine internal RL (e.g. there are internal randomized cognitive actions, and we use REINFORCE to get gradient estimates—i.e. you try to increase the probability of cognitive actions taken in rounds where you got a lower loss than predicted, and decrease the probability of actions taken in rounds where you got a higher loss), which might make the setting a bit more like the one Stuart is imagining.
Seems like the counterfactually issue doesn’t come up in the Opt case, since you aren’t training the algorithm incrementally—you’d just collect a relevant dataset before you started training. I think the Opt setting throws away too much for analyzing this kind of situation, and would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).
This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn’t that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future?
I don’t understand what you mean in this paragraph (especially “since each possible parameter setting is being evaluated on what other parameter settings say anyway”), even after reading Stuart’s post, plus Stuart has changed his mind and no longer endorses the conclusions in that post. I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart’s reasons for changing his mind? (Or talk to him again and get him to write the post for you. :)
Couldn’t you simulate that with Opt by just running it repeatedly?
“The best model” is usually regularized. I don’t think this really changes the picture compared to imagining optimizing over some smaller space (e.g. space of models with regularize<x). In particular, I don’t think my intuitions are sensitive to the difference.
The normal procedure is: I gather data, and am using the model (and other ML models) while I’m gathering data. I search over parameters to find the ones that would make the best predictions on that data.
I’m not finding parameters that result in good predictive accuracy when used in the world. I’m generating some data, and then finding the parameters that make the best predictions about that data. That data was collected in a world where there are plenty of ML systems (including potentially a version of my oracle with different parameters).
Yes, the normal procedure converges to a fixed point. But why do we care / why is that bad?
I take a perspective where I want to use ML techniques (or other AI algorithms) to do useful work, without introducing powerful optimization working at cross-purposes to humans. On that perspective I don’t think any of this is a problem (or if you look at it another way, it wouldn’t be a problem if you had a solution that had any chance at all of working).
I don’t think Stuart is thinking about it in this way, so it’s hard to engage at the object level, and I don’t really know what the alternative perspective is, so I also don’t know how to engage at the meta level.
Is there a particular claim where you think there is an interesting disagreement?
If I care about competitiveness, rerunning OPT for every new datapoint is pretty bad. (I don’t think this is very important in the current context, nothing depends on competitiveness.)
Does anyone know what Paul meant by this? I’m afraid I might be missing some relatively simple but important insight here.
Yeah, that’s a good point. In my most recent response to Wei Dai I was trying to develop a loss which would prevent that sort of coordination, but it does seem like if that’s happening then it’s a problem in any counterfactual oracle setup, not just this one. Though it is thus still a problem you’d have to solve if you ever actually wanted to implement a counterfactual oracle.
I was thinking about this, and it’s a bit unclear.
First, if you’re willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures. In this situation, you’re guarded against any possible manipulation incentive like that, but it also means your oracle will very rarely actually be given a reward in practice, which means if you’re relying on getting enough training data to produce an agent which will optimize for this objective, you’re screwed. I would argue, however, that if you expect to train an agent to behave as a counterfactual oracle in the first place, you’re already screwed, because most mesa-optimizers will care about things other than just the counterfactual case. Thus, the only situation in which this whole thing works in the first place is the situation where you’re already willing to make this (very strong) assumption, so it’s fine.
Second, I don’t think you’re entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure. For example, you could increase the probability of an erasure with each subquestion, or scale the reward exponentially with the depth at which the erasure occurs, so that the majority of the expected reward is always concentrated in the world where there is a complete erasure.
But if all subquestions have erasures, humans would have to manually execute the whole query tree, which is exponentially large so you’ll run out of resources (in the counterfactual world) if you tried to do that, so the Oracle won’t be able to give you a useful prediction. Wouldn’t it make more sense to have the Oracle make a prediction about a counterfactual world where some humans just think normally for a while and write down their thoughts (similar to my “predict the best AF posts” idea)? I don’t see what value the IDA idea is adding here.
Given the above, “only provide rewards in the event of a complete erasure” doesn’t seem to make sense as a target to approximate. Do you think your ideas in this paragraph still have value in light of that?
Yeah, that’s a good point.
Okay, here’s another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step’s worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict.
Maybe we can turn this into a zero-sum game, though? Here’s a proposal: let M′ be a copy of M and Qtree be the set of all questions in the current tree that also get erasures. Then, let LM(Q)=d(H(Q|M′),M(Q))−1|Qtree|−1∑Q′∈Qtree−{Q}d(H(Q′|M′),M′(Q′)) such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It’s still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures.
It is also worth noting, however, that even if this works it is a very artificial fix, since the term you’re subtracting is a constant with no dependence on M(Q), so if you’re trying to do gradient descent to optimize this loss, it won’t change anything at all (which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we’re still back at the problem of none of this working unless you’re willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.
Sorry I haven’t followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?
In the case of “actual” IDA, I guess the plan is for each overseer to look inside the model they’re training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I’m not sure how that can happen at the lower levels where the overseers are not very smart.
Even if you can specify that it tries to minimize that distance, it can make the answer to any query be a convincing argument that the reader should return this same convincing argument. That way, it scores perfectly on every inner node.
Two basic questions I couldn’t figure out (sorry):
Can you use a different oracle for every subquestion? If you can, how would this affect the concern Wei_Dai raises?
If we know the oracle is only optimizing for the specified objective function, are mesa-optimisers still a problem for the proposed system as a whole?
You can use a different oracle for every subquestion, but it’s unclear what exactly that does if you don’t know what the oracle’s actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution.
The key here, I think, is the degree to which you’re willing to make an assumption of the form you mention—that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you’re allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you’re assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that’s a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn’t care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you’re assuming.
Well, a given copy of the oracle wouldn’t directly recieve information from the other oracles about the questions they were asked. To the extent a problem remains (which I agree is likely without specific assumptions), wouldn’t it apply to all counterfactual oracles?