Suppose the interval between encounters with the predictor is long enough that, due to the agent’s temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)
Yes, but in this case the agent should two-box. This agent prefers sacrificing the future for short term gain, so that’s what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxes and causing the predictor to cooperate on the next round.
Btw, I am always interested in the asymptotic when the time discount parameter goes to 1, rather than the asymptotic when time discount is fixed and time goes to ∞. That is, when I say the agent “converges” to something, I mean it in the former sense. Because, if the agent mostly cares only about the next thousands years, then it matters little how successful it is a million years from now. That is, the former asymptotic tracks the actual expected utility rather than some arbitrary portion thereof. (Also, I think it is interesting to consider what I dubbed “semi-anytime” algorithms: algorithms that receive a lower bound for the time discount parameter as input and guarantee a level of performance (usually regret) that improves as this lower bound grows. Even better are anytime algorithms: algorithms that don’t depend on the time discount at all, but come with a performance guarantee that depends on the time discount. However, in settings such as Delegative RL it is impossible to have an anytime algorithm.)
Yes, but in this case the agent should two-box. This agent prefers sacrificing the future for short term gain, so that’s what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxes and causing the predictor to cooperate on the next round.
Ok, sort of, but then this makes the discounting “wrong” in the same way that hyperbolic discounting is “wrong”: dynamic inconsistency. One might then say such decision-theoretic questions reduce to a question of what the right way to discount is (constraints on discounting functions such that we rule out the ones which are “wrong”). I find this perspective somewhat plausible but not obviously right.
I wish I could get you to see the other possible perspective, in which there is something going wrong which is not about discounting behavior. IE, something closer to your remark about commitments. The two-boxer we are discussing can imagine the one-boxing strategy and can have regret with respect to it. Properly defining that notion of regret would allow a learner to one-box.
Maybe I’m still thinking about it the wrong way, but, I still am not convinced two-boxing is a legit values issue here. I think steep discounting doesn’t justify two-boxing. Imagine that we let both the ASP predictor and the agent think for a while before the first round—so the agent still has a lot more processing power than the predictor, but, the predictor is likely to get things right even on the first round. If the agent had been the sort of agent who would one-box, it would get higher reward on the first round. So it doesn’t seem to me like the steep-discounting agent “really prefers to two-box”.
The predictor can do well w/o any actual rounds being run because it is a logical inductor, so, is learning from closely analogous cases across logical possibility. Of course for this to make sense I have to assume the predictor has access to the agent’s source code. I’m also imagining that the agent has the predictor’s source code.
Btw, I am always interested in the asymptotic when the time discount parameter goes to 1, rather than the asymptotic when time discount is fixed and time goes to ∞.
Ok. So in some sense you’re thinking in the “there are incorrect discounting functions” way (ie, steeper is always worse than less steep).
But, what if this is the last newcomblike problem the agent can reasonably expect to be involved in? There will be no significant pressure from the future, so the agent will tend to defect/two-box/etc. But the predictor can see this coming (even with relatively little processing power, as in ASP). So the agent does worse overall than a (predictable) 1-boxer.
And, of course, if there’s a predictable last interaction, the argument via pressure from the future tends to unwind so that actually there will be many rounds on which the gains from 1-boxing are lost rather than only one.
Ok, sort of, but then this makes the discounting “wrong” in the same way that hyperbolic discounting is “wrong”: dynamic inconsistency… I wish I could get you to see the other possible perspective, in which there is something going wrong which is not about discounting behavior… The two-boxer we are discussing can imagine the one-boxing strategy and can have regret with respect to it.
Look, a one-boxing agent would get less utility. Yes, it would have higher rewards on subsequent rounds, but the first round is more important, by the very assumption you made (steep time discount). Moreover, the two-boxing agent would get higher utility than any agent that one-boxes on some of the rounds, so on all rounds the correct action is two-box.
Imagine that we let both the ASP predictor and the agent think for a while before the first round—so the agent still has a lot more processing power than the predictor, but, the predictor is likely to get things right even on the first round… The predictor can do well w/o any actual rounds being run because it is a logical inductor, so, is learning from closely analogous cases across logical possibility. Of course for this to make sense I have to assume the predictor has access to the agent’s source code. I’m also imagining that the agent has the predictor’s source code.
I don’t think it makes sense to simultaneously assume that the agent has a lot more power than the predictor and that the predictor is likely to get things right even on the first round. If the agent has a lot more power than the predictor, then the agent can e.g. diagonalize against the predictor and make sure it will not get things right.
Once again, we need to ask how does the agent know the predictor gets things right on the first round. It needs to learn it somehow. For example, maybe in encounters many predictors one after the other. But then, it will again learn a model in which one-boxing causes the predictor to cooperate.
(Notice that having a perfect predictor requires that either (i) the agent is deterministic or (ii) the predictor has access to the agent’s random bits. Option (i) is possible for the case of complete models (because there is a deterministic Bayes-optimal policy; to establish weak feasibility you can use PSRL + pseudorandom), which is sufficient for Newcomb’s paradox. For incomplete models you usually want to randomize, although it also possible to instead do deterministic maximin and optionally add an external random bit generator which is not a priori assumed to be invisible to the environment. In option (ii), you can consider a variant of Newcomb’s where the predictor responds to the probability of the agent making a particular choice. This variant is more difficult for reasons similar to what happens with counterfactual mugging: it is hard for the agent to discover this behavior empirically. I think that, as opposed to counterfactual mugging, you can capture it by an incomplete model, although it will be a relatively complex, difficult to learn, model.)
So in some sense you’re thinking in the “there are incorrect discounting functions” way (ie, steeper is always worse than less steep).
Well, the agents I consider interesting start from a state of ignorance, so they have to learn. For this, the time discount has to be shallow, otherwise there is no time to learn.
Look, a one-boxing agent would get less utility. Yes, it would have higher rewards on subsequent rounds, but the first round is more important, by the very assumption you made (steep time discount). Moreover, the two-boxing agent would get higher utility than any agent that one-boxes on some of the rounds, so on all rounds the correct action is two-box.
I agree in the case where the predictor is only good at all later, ie, its initial guess is random/fixed. Which is consistent with the way I originally brought ASP up. In that case, it absolutely makes sense to 2-box if your discounting is steep enough.
But, deliberation takes time. We can think of logical inductors, or Skyrms-style deliberation setups (Skyrms, The Dynamics of Rational Deliberation). The agent has a certain amount of time to think. If it spends all that time calculating probabilities (as in LIDT), and only makes a decision at the end, then it makes sense to say you can’t successfully 1-box in a way which makes you predictably do so. But if you model the decision as being made over an amount of time, then you could possibly make the decision early on in your deliberation, so that the predictor can see. So, we can think of something like a commitment which you can make while still refining probabilities.
Again, I agree this doesn’t make sense if the predictor is absolutely immovable (only takes our choice as evidence to refine its prediction in later rounds). But if the predictor is something like a logical inductor, then even if it runs much slower than the agent, it can “see” some of the early things the agent does.
Of course, there is the question of how you would know that making an early commitment rather than thinking longer would be a good strategy. I’m imagining they get each other’s source code. We can think about it in a learning-theoretic way as follows. Suppose I want good behavior on problems where two AIs exchange source code and then have to play a game (with a certain amount of time to think before they have to act). The system can do anything with its time; think longer, make early commitments, whatever.
Actually, let’s simplify a little: since the predictor is not fully an agent, we can just think in terms of building a system which is given source code which outputs the end utility (so the source code includes the predictor, the agent itself, everything). I want a system which does well on such problems if it has enough time to think. In order to get that, what I’m going to do is have it spend the first fraction of its thinking time training itself on smaller problems sampled at random. (It could and should choose instances intelligently, but let’s say it doesn’t.) It treats these entirely episodically; in each instance it only tries to maximize the score in that one case.
Nonetheless, it can eventually learn to make its decision early in cases where it is beneficial to be predictable.
This is all supposed to be a model of what our agent can do in order to come to a decision in ASP. It can think about similar problems. It can notice a trend where it’s useful to make certain decisions early. Sure, it might figure that out too late, in which case the best it can do is probably 2-box. But it might also figure it out in time.
I don’t think it makes sense to simultaneously assume that the agent has a lot more power than the predictor and that the predictor is likely to get things right even on the first round. If the agent has a lot more power than the predictor, then the agent can e.g. diagonalize against the predictor and make sure it will not get things right.
It can do that, which would make it unpredictable. However, it can also not do that. In this case it has no special motivation to do that, so there’s no reason to, which means there’s no reason for the predictor to expect that.
Notice that having a perfect predictor requires that either (i) the agent is deterministic or (ii) the predictor has access to the agent’s random bits.
Not sure whether this was a point you were trying to make, but, I agree that in the case of a perfect predictor learning theory does a fine job of 1-boxing regardless of temporal discounting or any commitment mechanisms.
I’m worrying that the overarching thread may get lost. I think this point about ASP is an important sub-point, for a few reasons, but I hope the discussion doesn’t dead-end here. My picture of where we’re at:
You initially made a claim that you could get some subset of decision problems, such as XOR, via learning-theoretic means. I was trying to learn more about that.
You made a comment about problems outside the subset you can get being addressable with commitment mechanisms, while adding that you weren’t certain this should be a desiderata.
I was interested in the claim about commitment mechanisms, because I think you can get somewhat messy/patchwork kind-of-updateless-ish stuff out of that, but would be pretty interested if there’s an elegant version.
We started into differences of intuition about what is desirable.
We settled into going back and forth about ASP.
I’ve gravitated to ASP partly because I hope it illustrates something about how I see logic as relevant to a learning-theoretic approach to decision theory. But I’m still interested more broadly in what class of decision problems you think you can do well on, with and without commitment mechanisms.
...But if you model the decision as being made over an amount of time, then you could possibly make the decision early on in your deliberation, so that the predictor can see.
Yes, sounds sort of reasonable. Here is how I think you can realize this using TRL.
As usual, we consider the agent playing an IPD against a predictor (Newcomb’s paradox is essentially playing the Prisoner’s Dilemma against a “FairBot”). On each round, the predictor gets to see the agent’s state at the start of the round. (The state can be considered part of the “source code”. For randomizing agents, we also assume the predictors sees the random bits). The predictor then tries to simulate the agent (we assume it knows the rest of the agent’s source code as well), and is successful if the agent doesn’t execute any programs that are too expensive for the predictor (for the sake of simplicity, assume that no program started on one round continues running during following rounds: I don’t think that this assumption makes a difference of principle). Otherwise, the prediction might be wrong (for example, we can assume it defaults to D). The predictor then plays D or C according to its prediction of the agent’s action.
In this setting, the agent can learn the incomplete hypothesis “if I don’t run expensive programs and I play C, the predictor will also play C”. (We assume that the prior allows for side effects of executing programs. Such a prior seems more realistic anyway, and in particular is required to counter non-Cartesian daemons. However, it also has a cost, so perhaps what we really want is a prior that is biased towards few side effects: but, this is immaterial for the current discussion.) This hypotheses guarantees a payoff of U(CC). Assuming that the predictor cannot be exploited, this is the best possible payoff and therefore the agent will converge to cooperation.
We might want a more explicit realization of the “simulates” part in “agent simulates predictor”. For this, we can assume the agent also receives its own state as an observation (but, I’m not sure how generally useful is this for realistic agents). The agent can then also learn the incomplete hypothesis describing the exact function from agent states to predictor output. However, this hypothesis doesn’t affect the outcome: it doesn’t predict the agent’s state and therefore can only guarantee the payoff U(DD).
Yes, but in this case the agent should two-box. This agent prefers sacrificing the future for short term gain, so that’s what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxes and causing the predictor to cooperate on the next round.
Btw, I am always interested in the asymptotic when the time discount parameter goes to 1, rather than the asymptotic when time discount is fixed and time goes to ∞. That is, when I say the agent “converges” to something, I mean it in the former sense. Because, if the agent mostly cares only about the next thousands years, then it matters little how successful it is a million years from now. That is, the former asymptotic tracks the actual expected utility rather than some arbitrary portion thereof. (Also, I think it is interesting to consider what I dubbed “semi-anytime” algorithms: algorithms that receive a lower bound for the time discount parameter as input and guarantee a level of performance (usually regret) that improves as this lower bound grows. Even better are anytime algorithms: algorithms that don’t depend on the time discount at all, but come with a performance guarantee that depends on the time discount. However, in settings such as Delegative RL it is impossible to have an anytime algorithm.)
Ok, sort of, but then this makes the discounting “wrong” in the same way that hyperbolic discounting is “wrong”: dynamic inconsistency. One might then say such decision-theoretic questions reduce to a question of what the right way to discount is (constraints on discounting functions such that we rule out the ones which are “wrong”). I find this perspective somewhat plausible but not obviously right.
I wish I could get you to see the other possible perspective, in which there is something going wrong which is not about discounting behavior. IE, something closer to your remark about commitments. The two-boxer we are discussing can imagine the one-boxing strategy and can have regret with respect to it. Properly defining that notion of regret would allow a learner to one-box.
Maybe I’m still thinking about it the wrong way, but, I still am not convinced two-boxing is a legit values issue here. I think steep discounting doesn’t justify two-boxing. Imagine that we let both the ASP predictor and the agent think for a while before the first round—so the agent still has a lot more processing power than the predictor, but, the predictor is likely to get things right even on the first round. If the agent had been the sort of agent who would one-box, it would get higher reward on the first round. So it doesn’t seem to me like the steep-discounting agent “really prefers to two-box”.
The predictor can do well w/o any actual rounds being run because it is a logical inductor, so, is learning from closely analogous cases across logical possibility. Of course for this to make sense I have to assume the predictor has access to the agent’s source code. I’m also imagining that the agent has the predictor’s source code.
Ok. So in some sense you’re thinking in the “there are incorrect discounting functions” way (ie, steeper is always worse than less steep).
But, what if this is the last newcomblike problem the agent can reasonably expect to be involved in? There will be no significant pressure from the future, so the agent will tend to defect/two-box/etc. But the predictor can see this coming (even with relatively little processing power, as in ASP). So the agent does worse overall than a (predictable) 1-boxer.
And, of course, if there’s a predictable last interaction, the argument via pressure from the future tends to unwind so that actually there will be many rounds on which the gains from 1-boxing are lost rather than only one.
Look, a one-boxing agent would get less utility. Yes, it would have higher rewards on subsequent rounds, but the first round is more important, by the very assumption you made (steep time discount). Moreover, the two-boxing agent would get higher utility than any agent that one-boxes on some of the rounds, so on all rounds the correct action is two-box.
I don’t think it makes sense to simultaneously assume that the agent has a lot more power than the predictor and that the predictor is likely to get things right even on the first round. If the agent has a lot more power than the predictor, then the agent can e.g. diagonalize against the predictor and make sure it will not get things right.
Once again, we need to ask how does the agent know the predictor gets things right on the first round. It needs to learn it somehow. For example, maybe in encounters many predictors one after the other. But then, it will again learn a model in which one-boxing causes the predictor to cooperate.
(Notice that having a perfect predictor requires that either (i) the agent is deterministic or (ii) the predictor has access to the agent’s random bits. Option (i) is possible for the case of complete models (because there is a deterministic Bayes-optimal policy; to establish weak feasibility you can use PSRL + pseudorandom), which is sufficient for Newcomb’s paradox. For incomplete models you usually want to randomize, although it also possible to instead do deterministic maximin and optionally add an external random bit generator which is not a priori assumed to be invisible to the environment. In option (ii), you can consider a variant of Newcomb’s where the predictor responds to the probability of the agent making a particular choice. This variant is more difficult for reasons similar to what happens with counterfactual mugging: it is hard for the agent to discover this behavior empirically. I think that, as opposed to counterfactual mugging, you can capture it by an incomplete model, although it will be a relatively complex, difficult to learn, model.)
Well, the agents I consider interesting start from a state of ignorance, so they have to learn. For this, the time discount has to be shallow, otherwise there is no time to learn.
I agree in the case where the predictor is only good at all later, ie, its initial guess is random/fixed. Which is consistent with the way I originally brought ASP up. In that case, it absolutely makes sense to 2-box if your discounting is steep enough.
But, deliberation takes time. We can think of logical inductors, or Skyrms-style deliberation setups (Skyrms, The Dynamics of Rational Deliberation). The agent has a certain amount of time to think. If it spends all that time calculating probabilities (as in LIDT), and only makes a decision at the end, then it makes sense to say you can’t successfully 1-box in a way which makes you predictably do so. But if you model the decision as being made over an amount of time, then you could possibly make the decision early on in your deliberation, so that the predictor can see. So, we can think of something like a commitment which you can make while still refining probabilities.
Again, I agree this doesn’t make sense if the predictor is absolutely immovable (only takes our choice as evidence to refine its prediction in later rounds). But if the predictor is something like a logical inductor, then even if it runs much slower than the agent, it can “see” some of the early things the agent does.
Of course, there is the question of how you would know that making an early commitment rather than thinking longer would be a good strategy. I’m imagining they get each other’s source code. We can think about it in a learning-theoretic way as follows. Suppose I want good behavior on problems where two AIs exchange source code and then have to play a game (with a certain amount of time to think before they have to act). The system can do anything with its time; think longer, make early commitments, whatever.
Actually, let’s simplify a little: since the predictor is not fully an agent, we can just think in terms of building a system which is given source code which outputs the end utility (so the source code includes the predictor, the agent itself, everything). I want a system which does well on such problems if it has enough time to think. In order to get that, what I’m going to do is have it spend the first fraction of its thinking time training itself on smaller problems sampled at random. (It could and should choose instances intelligently, but let’s say it doesn’t.) It treats these entirely episodically; in each instance it only tries to maximize the score in that one case.
Nonetheless, it can eventually learn to make its decision early in cases where it is beneficial to be predictable.
This is all supposed to be a model of what our agent can do in order to come to a decision in ASP. It can think about similar problems. It can notice a trend where it’s useful to make certain decisions early. Sure, it might figure that out too late, in which case the best it can do is probably 2-box. But it might also figure it out in time.
It can do that, which would make it unpredictable. However, it can also not do that. In this case it has no special motivation to do that, so there’s no reason to, which means there’s no reason for the predictor to expect that.
Not sure whether this was a point you were trying to make, but, I agree that in the case of a perfect predictor learning theory does a fine job of 1-boxing regardless of temporal discounting or any commitment mechanisms.
I’m worrying that the overarching thread may get lost. I think this point about ASP is an important sub-point, for a few reasons, but I hope the discussion doesn’t dead-end here. My picture of where we’re at:
You initially made a claim that you could get some subset of decision problems, such as XOR, via learning-theoretic means. I was trying to learn more about that.
You made a comment about problems outside the subset you can get being addressable with commitment mechanisms, while adding that you weren’t certain this should be a desiderata.
I was interested in the claim about commitment mechanisms, because I think you can get somewhat messy/patchwork kind-of-updateless-ish stuff out of that, but would be pretty interested if there’s an elegant version.
We started into differences of intuition about what is desirable.
We settled into going back and forth about ASP.
I’ve gravitated to ASP partly because I hope it illustrates something about how I see logic as relevant to a learning-theoretic approach to decision theory. But I’m still interested more broadly in what class of decision problems you think you can do well on, with and without commitment mechanisms.
Yes, sounds sort of reasonable. Here is how I think you can realize this using TRL.
As usual, we consider the agent playing an IPD against a predictor (Newcomb’s paradox is essentially playing the Prisoner’s Dilemma against a “FairBot”). On each round, the predictor gets to see the agent’s state at the start of the round. (The state can be considered part of the “source code”. For randomizing agents, we also assume the predictors sees the random bits). The predictor then tries to simulate the agent (we assume it knows the rest of the agent’s source code as well), and is successful if the agent doesn’t execute any programs that are too expensive for the predictor (for the sake of simplicity, assume that no program started on one round continues running during following rounds: I don’t think that this assumption makes a difference of principle). Otherwise, the prediction might be wrong (for example, we can assume it defaults to D). The predictor then plays D or C according to its prediction of the agent’s action.
In this setting, the agent can learn the incomplete hypothesis “if I don’t run expensive programs and I play C, the predictor will also play C”. (We assume that the prior allows for side effects of executing programs. Such a prior seems more realistic anyway, and in particular is required to counter non-Cartesian daemons. However, it also has a cost, so perhaps what we really want is a prior that is biased towards few side effects: but, this is immaterial for the current discussion.) This hypotheses guarantees a payoff of U(CC). Assuming that the predictor cannot be exploited, this is the best possible payoff and therefore the agent will converge to cooperation.
We might want a more explicit realization of the “simulates” part in “agent simulates predictor”. For this, we can assume the agent also receives its own state as an observation (but, I’m not sure how generally useful is this for realistic agents). The agent can then also learn the incomplete hypothesis describing the exact function from agent states to predictor output. However, this hypothesis doesn’t affect the outcome: it doesn’t predict the agent’s state and therefore can only guarantee the payoff U(DD).