This might seem surprising at first, because there is also a different incomplete model Φ2 that says “if you pays the blackmail, infestation will not happen”. Φ2 is false if you use physical causal counterfactuals, but from the agent’s perspective Φ2 is consistent with all observations. However, Φ2 only guarantees the payoff −c (because it is unknown whether the blackmail will arrive). Therefore, Φ2 will have no effect on the ultimate behavior of the agent.
What happens in ASP? (Say you’re in an iterated Newcomb’s problem with a predictor much slower than you, but which meets the LIC or similar.) I’m concerned that it will either settle on two-boxing, or possibly not settle on one strategy, since if it settles on two-boxing then a model which says “you can get the higher reward by one-boxing” (ie, the agent has control over the predictor) looks appealing; but, if it settles on one-boxing, a model which says “you can get higher reward by two-boxing” (ie, the agent’s action doesn’t control the predictor) looks appealing. This concern is related to the way asymptotic decision theory fails—granted, for cases outside of its definition of “fair”.
The precommitments have to expire after some finite time.
I still have a little hope that there will be a nice version, which doesn’t involve a commitment-races problem and which doesn’t make use of an arbitrary commitment cutoff. But I would agree that things don’t look good, and so it is reasonable to put this kind of thing outside of “fair” problems.
Let me add that I am not even sure what are the correct desiderata. In particular, I don’t think that we should expect any group of good agents to converge to a Pareto optimal outcome.
I don’t currently see why we shouldn’t ask to converge to pareto optima. Obviously, we can’t expect to do so with arbitrary other agents; but it doesn’t seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm. This even seems reasonable in the standard iterated Nash picture (where not all strategies achieve pareto optima, but there exist strategies which achieve pareto optima with a broad-ish class of other strategies, including others who use strategies like their own—while being very difficult to exploit).
But yeah, I’m pretty uncertain about what the desiderata should be—both with respect to game theory, and with respect to scenarios which require updatelessness/precommitments in order to do well. I agree that it should all be approached with a learning-theoretic perspective.
What happens in ASP? (Say you’re in an iterated Newcomb’s problem with a predictor much slower than you, but which meets the LIC or similar.)
I am not sure what you mean by “meets the LIC or similar” in this context. If we consider a predictor which is a learning algorithm in itself (i.e., it predicts by learning from the agent’s past choices), then the agent will converge to one-boxing. This is because a weak predictor will be fully inside the agent’s prior, so the agent will know that one-boxing for long enough will cause the predictor to fill the box. If we consider a predictor that just analyzes the agent’s source code and ignores the agent’s choices, the agent will converge to two-boxing.
I was never convinced that “logical ASP” is a “fair” problem. I once joked with Scott that we can consider a “predictor” that is just the single line of code “return DEFECT” but in the comments it says “I am defecting only because I know you will defect.” It was a joke, but it was half-serious. The notion of “weak predictor” taken to the limit leads to absurdity, and if you don’t take it to the limit it might still lead to absurdity. Logical inductors in one way to try specifying a “weak predictor”, but I am not convinced that settings in which logic is inserted ad hoc should be made into desiderata.
I still have a little hope that there will be a nice version, which doesn’t involve a commitment-races problem and which doesn’t make use of an arbitrary commitment cutoff.
I am not sure we need an arbitrary cutoff. There might be a good solution where the agent can dynamically choose any finite cutoff.
...it doesn’t seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm.
Maybe? The questions are, how robust is this cooperation (i.e. what counts as “same algorithm”), and whether there is a significant cost in other situations. And, on the philosophical-ish level, the question is whether such desiderata should be considered essential for rationality/intelligence. But, I agree that this is worth studying.
I am not sure what you mean by “meets the LIC or similar” in this context. If we consider a predictor which is a learning algorithm in itself (i.e., it predicts by learning from the agent’s past choices),
Yeah, that’s what I meant.
, then the agent will converge to one-boxing. This is because a weak predictor will be fully inside the agent’s prior, so the agent will know that one-boxing for long enough will cause the predictor to fill the box.
Suppose the interval between encounters with the predictor is long enough that, due to the agent’s temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)
But I wasn’t sure this would be the result for your agent, since you described the agent using the hypothesis which gives the best picture about achievable utility.
As I discussed in Do Sufficiently Advanced Agents Use Logic, what I tend to think about is the case where the agent doesn’t literally encounter the predictor repeatedly in its physical history. Instead, the agent must learn what strategy to use by reasoning about similar (but “smaller”) scenarios. But we can get the same effect by assuming the temporal discounting is steep enough, as above.
I was never convinced that “logical ASP” is a “fair” problem. I once joked with Scott that we can consider a “predictor” that is just the single line of code “return DEFECT” but in the comments it says “I am defecting only because I know you will defect.” It was a joke, but it was half-serious. The notion of “weak predictor” taken to the limit leads to absurdity, and if you don’t take it to the limit it might still lead to absurdity. Logical inductors in one way to try specifying a “weak predictor”, but I am not convinced that settings in which logic is inserted ad hoc should be made into desiderata.
Yeah, it is clear that there has to be a case where the predictor is so weak that the agent should not care. I’m fine with dropping the purely logical cases as desiderata in favor of the learning-theoretic versions. But, the ability to construct analogous problems for logic and for learning theory is notable. Paying attention to that analogy more generally seems like a good idea.
I am not sure we need an arbitrary cutoff. There might be a good solution where the agent can dynamically choose any finite cutoff.
Yeah, I guess we can do a variety of things:
Naming a time limit for the commitment.
Naming a time at which a time limit for the commitment will be named.
Naming an ordinal (in some ordinal notation), so that a smaller ordinal must be named every time-step, until a smaller ordinal cannot be named, at which point the commitment runs out
I suspect I want to evaluate a commitment scheme by asking whether it helps achieve a nice regret-bound notion, rather than defining the regret notion by evaluating regret-with-respect-to-making-commitments.
Thinking about LI policy selection where we choose a slow-growing function f(n) which determines how long we think before we choose the policy to follow on day n –– there’s this weird trade-off between how (apparently) “good” the updatelessness is vs how long it takes to be any good at all. I’m fine with notions of rationality being parameterized by an ordinal or some such if it’s just a choose-the-largest-number game. But in this case, choosing too slow-growing a function makes you worse off; so the fact that the rationality principle is parameterized (by the slow-growing function) is problematic. Choosing a commitment scheme seems similar.
So it would be nice to have a rationality notion which clarified this situation.
My main concern here is: the case for empirical updatelessness seems strong in realizable situations where the prior is meaningful. Things aren’t as nice in the non-realizable cases such as logical uncertainty. But it doesn’t make sense to abandon updateless principles altogether because of this!
Suppose the interval between encounters with the predictor is long enough that, due to the agent’s temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)
Yes, but in this case the agent should two-box. This agent prefers sacrificing the future for short term gain, so that’s what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxes and causing the predictor to cooperate on the next round.
Btw, I am always interested in the asymptotic when the time discount parameter goes to 1, rather than the asymptotic when time discount is fixed and time goes to ∞. That is, when I say the agent “converges” to something, I mean it in the former sense. Because, if the agent mostly cares only about the next thousands years, then it matters little how successful it is a million years from now. That is, the former asymptotic tracks the actual expected utility rather than some arbitrary portion thereof. (Also, I think it is interesting to consider what I dubbed “semi-anytime” algorithms: algorithms that receive a lower bound for the time discount parameter as input and guarantee a level of performance (usually regret) that improves as this lower bound grows. Even better are anytime algorithms: algorithms that don’t depend on the time discount at all, but come with a performance guarantee that depends on the time discount. However, in settings such as Delegative RL it is impossible to have an anytime algorithm.)
Yes, but in this case the agent should two-box. This agent prefers sacrificing the future for short term gain, so that’s what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxes and causing the predictor to cooperate on the next round.
Ok, sort of, but then this makes the discounting “wrong” in the same way that hyperbolic discounting is “wrong”: dynamic inconsistency. One might then say such decision-theoretic questions reduce to a question of what the right way to discount is (constraints on discounting functions such that we rule out the ones which are “wrong”). I find this perspective somewhat plausible but not obviously right.
I wish I could get you to see the other possible perspective, in which there is something going wrong which is not about discounting behavior. IE, something closer to your remark about commitments. The two-boxer we are discussing can imagine the one-boxing strategy and can have regret with respect to it. Properly defining that notion of regret would allow a learner to one-box.
Maybe I’m still thinking about it the wrong way, but, I still am not convinced two-boxing is a legit values issue here. I think steep discounting doesn’t justify two-boxing. Imagine that we let both the ASP predictor and the agent think for a while before the first round—so the agent still has a lot more processing power than the predictor, but, the predictor is likely to get things right even on the first round. If the agent had been the sort of agent who would one-box, it would get higher reward on the first round. So it doesn’t seem to me like the steep-discounting agent “really prefers to two-box”.
The predictor can do well w/o any actual rounds being run because it is a logical inductor, so, is learning from closely analogous cases across logical possibility. Of course for this to make sense I have to assume the predictor has access to the agent’s source code. I’m also imagining that the agent has the predictor’s source code.
Btw, I am always interested in the asymptotic when the time discount parameter goes to 1, rather than the asymptotic when time discount is fixed and time goes to ∞.
Ok. So in some sense you’re thinking in the “there are incorrect discounting functions” way (ie, steeper is always worse than less steep).
But, what if this is the last newcomblike problem the agent can reasonably expect to be involved in? There will be no significant pressure from the future, so the agent will tend to defect/two-box/etc. But the predictor can see this coming (even with relatively little processing power, as in ASP). So the agent does worse overall than a (predictable) 1-boxer.
And, of course, if there’s a predictable last interaction, the argument via pressure from the future tends to unwind so that actually there will be many rounds on which the gains from 1-boxing are lost rather than only one.
Ok, sort of, but then this makes the discounting “wrong” in the same way that hyperbolic discounting is “wrong”: dynamic inconsistency… I wish I could get you to see the other possible perspective, in which there is something going wrong which is not about discounting behavior… The two-boxer we are discussing can imagine the one-boxing strategy and can have regret with respect to it.
Look, a one-boxing agent would get less utility. Yes, it would have higher rewards on subsequent rounds, but the first round is more important, by the very assumption you made (steep time discount). Moreover, the two-boxing agent would get higher utility than any agent that one-boxes on some of the rounds, so on all rounds the correct action is two-box.
Imagine that we let both the ASP predictor and the agent think for a while before the first round—so the agent still has a lot more processing power than the predictor, but, the predictor is likely to get things right even on the first round… The predictor can do well w/o any actual rounds being run because it is a logical inductor, so, is learning from closely analogous cases across logical possibility. Of course for this to make sense I have to assume the predictor has access to the agent’s source code. I’m also imagining that the agent has the predictor’s source code.
I don’t think it makes sense to simultaneously assume that the agent has a lot more power than the predictor and that the predictor is likely to get things right even on the first round. If the agent has a lot more power than the predictor, then the agent can e.g. diagonalize against the predictor and make sure it will not get things right.
Once again, we need to ask how does the agent know the predictor gets things right on the first round. It needs to learn it somehow. For example, maybe in encounters many predictors one after the other. But then, it will again learn a model in which one-boxing causes the predictor to cooperate.
(Notice that having a perfect predictor requires that either (i) the agent is deterministic or (ii) the predictor has access to the agent’s random bits. Option (i) is possible for the case of complete models (because there is a deterministic Bayes-optimal policy; to establish weak feasibility you can use PSRL + pseudorandom), which is sufficient for Newcomb’s paradox. For incomplete models you usually want to randomize, although it also possible to instead do deterministic maximin and optionally add an external random bit generator which is not a priori assumed to be invisible to the environment. In option (ii), you can consider a variant of Newcomb’s where the predictor responds to the probability of the agent making a particular choice. This variant is more difficult for reasons similar to what happens with counterfactual mugging: it is hard for the agent to discover this behavior empirically. I think that, as opposed to counterfactual mugging, you can capture it by an incomplete model, although it will be a relatively complex, difficult to learn, model.)
So in some sense you’re thinking in the “there are incorrect discounting functions” way (ie, steeper is always worse than less steep).
Well, the agents I consider interesting start from a state of ignorance, so they have to learn. For this, the time discount has to be shallow, otherwise there is no time to learn.
Look, a one-boxing agent would get less utility. Yes, it would have higher rewards on subsequent rounds, but the first round is more important, by the very assumption you made (steep time discount). Moreover, the two-boxing agent would get higher utility than any agent that one-boxes on some of the rounds, so on all rounds the correct action is two-box.
I agree in the case where the predictor is only good at all later, ie, its initial guess is random/fixed. Which is consistent with the way I originally brought ASP up. In that case, it absolutely makes sense to 2-box if your discounting is steep enough.
But, deliberation takes time. We can think of logical inductors, or Skyrms-style deliberation setups (Skyrms, The Dynamics of Rational Deliberation). The agent has a certain amount of time to think. If it spends all that time calculating probabilities (as in LIDT), and only makes a decision at the end, then it makes sense to say you can’t successfully 1-box in a way which makes you predictably do so. But if you model the decision as being made over an amount of time, then you could possibly make the decision early on in your deliberation, so that the predictor can see. So, we can think of something like a commitment which you can make while still refining probabilities.
Again, I agree this doesn’t make sense if the predictor is absolutely immovable (only takes our choice as evidence to refine its prediction in later rounds). But if the predictor is something like a logical inductor, then even if it runs much slower than the agent, it can “see” some of the early things the agent does.
Of course, there is the question of how you would know that making an early commitment rather than thinking longer would be a good strategy. I’m imagining they get each other’s source code. We can think about it in a learning-theoretic way as follows. Suppose I want good behavior on problems where two AIs exchange source code and then have to play a game (with a certain amount of time to think before they have to act). The system can do anything with its time; think longer, make early commitments, whatever.
Actually, let’s simplify a little: since the predictor is not fully an agent, we can just think in terms of building a system which is given source code which outputs the end utility (so the source code includes the predictor, the agent itself, everything). I want a system which does well on such problems if it has enough time to think. In order to get that, what I’m going to do is have it spend the first fraction of its thinking time training itself on smaller problems sampled at random. (It could and should choose instances intelligently, but let’s say it doesn’t.) It treats these entirely episodically; in each instance it only tries to maximize the score in that one case.
Nonetheless, it can eventually learn to make its decision early in cases where it is beneficial to be predictable.
This is all supposed to be a model of what our agent can do in order to come to a decision in ASP. It can think about similar problems. It can notice a trend where it’s useful to make certain decisions early. Sure, it might figure that out too late, in which case the best it can do is probably 2-box. But it might also figure it out in time.
I don’t think it makes sense to simultaneously assume that the agent has a lot more power than the predictor and that the predictor is likely to get things right even on the first round. If the agent has a lot more power than the predictor, then the agent can e.g. diagonalize against the predictor and make sure it will not get things right.
It can do that, which would make it unpredictable. However, it can also not do that. In this case it has no special motivation to do that, so there’s no reason to, which means there’s no reason for the predictor to expect that.
Notice that having a perfect predictor requires that either (i) the agent is deterministic or (ii) the predictor has access to the agent’s random bits.
Not sure whether this was a point you were trying to make, but, I agree that in the case of a perfect predictor learning theory does a fine job of 1-boxing regardless of temporal discounting or any commitment mechanisms.
I’m worrying that the overarching thread may get lost. I think this point about ASP is an important sub-point, for a few reasons, but I hope the discussion doesn’t dead-end here. My picture of where we’re at:
You initially made a claim that you could get some subset of decision problems, such as XOR, via learning-theoretic means. I was trying to learn more about that.
You made a comment about problems outside the subset you can get being addressable with commitment mechanisms, while adding that you weren’t certain this should be a desiderata.
I was interested in the claim about commitment mechanisms, because I think you can get somewhat messy/patchwork kind-of-updateless-ish stuff out of that, but would be pretty interested if there’s an elegant version.
We started into differences of intuition about what is desirable.
We settled into going back and forth about ASP.
I’ve gravitated to ASP partly because I hope it illustrates something about how I see logic as relevant to a learning-theoretic approach to decision theory. But I’m still interested more broadly in what class of decision problems you think you can do well on, with and without commitment mechanisms.
...But if you model the decision as being made over an amount of time, then you could possibly make the decision early on in your deliberation, so that the predictor can see.
Yes, sounds sort of reasonable. Here is how I think you can realize this using TRL.
As usual, we consider the agent playing an IPD against a predictor (Newcomb’s paradox is essentially playing the Prisoner’s Dilemma against a “FairBot”). On each round, the predictor gets to see the agent’s state at the start of the round. (The state can be considered part of the “source code”. For randomizing agents, we also assume the predictors sees the random bits). The predictor then tries to simulate the agent (we assume it knows the rest of the agent’s source code as well), and is successful if the agent doesn’t execute any programs that are too expensive for the predictor (for the sake of simplicity, assume that no program started on one round continues running during following rounds: I don’t think that this assumption makes a difference of principle). Otherwise, the prediction might be wrong (for example, we can assume it defaults to D). The predictor then plays D or C according to its prediction of the agent’s action.
In this setting, the agent can learn the incomplete hypothesis “if I don’t run expensive programs and I play C, the predictor will also play C”. (We assume that the prior allows for side effects of executing programs. Such a prior seems more realistic anyway, and in particular is required to counter non-Cartesian daemons. However, it also has a cost, so perhaps what we really want is a prior that is biased towards few side effects: but, this is immaterial for the current discussion.) This hypotheses guarantees a payoff of U(CC). Assuming that the predictor cannot be exploited, this is the best possible payoff and therefore the agent will converge to cooperation.
We might want a more explicit realization of the “simulates” part in “agent simulates predictor”. For this, we can assume the agent also receives its own state as an observation (but, I’m not sure how generally useful is this for realistic agents). The agent can then also learn the incomplete hypothesis describing the exact function from agent states to predictor output. However, this hypothesis doesn’t affect the outcome: it doesn’t predict the agent’s state and therefore can only guarantee the payoff U(DD).
I was never convinced that “logical ASP” is a “fair” problem. I once joked with Scott that we can consider a “predictor” that is just the single line of code “return DEFECT” but in the comments it says “I am defecting only because I know you will defect.”
I’m leaning this way as well, but I think it’s an important clue to figuring out commitment races. ASP Predictor, DefectBot, and a more general agent will make different commitments, and these things are already algorithms specialized for certain situations. How is the chosen commitment related to what the thing making the commitment is?
When an agent can manipulate a predictor in some sense, what should the predictor do? If it starts scheming with its thoughts, it’s no longer a predictor, it’s just another agent that wants to do something “predictory”. Maybe it can only give up, as in ASP, which acts as a precommitment that’s more thematically fitting for a predictor than for a general agent. It’s still a commitment race then, but possibly the meaning of something being a predictor is preserved by restricting the kind of commitment that it makes: the commitment of a non-general agent is what it is rather than what it does, and a general agent is only committed to its preference. Thus a general agent loses all knowledge in an attempt to out-commit others, because it hasn’t committed to that knowledge, didn’t make it part of what it is.
I don’t currently see why we shouldn’t ask to converge to pareto optima. Obviously, we can’t expect to do so with arbitrary other agents; but it doesn’t seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm.
I had an interesting thought that made me update towards your position. There is my old post about “metathreat equilibria”, an idea I developed with Scott’s help during my trip to Los Angeles. So, I just realized the same principle can be realized in the setting of repeated games. In particular, I am rather confident that the following, if formulated a little more rigorously, is a theorem:
Consider the Iterated Prisoner’s Dilemma in which strategies are constrained to depend only on the action of the opponent in the previous round. Then, at the limit γ→1 (shallow time discount), the only thermodynamic equilibrium is mutual cooperation.
Then, there is the question of how to generalize it. Obviously we want to consider more general games and more general strategies, but allowing fully general strategies probably won’t work (just like allowing arbitrary programs doesn’t work in the “self-modification” setting). One obvious generalization is allowing the strategy to depend on some finite suffix of the history. But, this isn’t natural in the context of agent theory: why would agents forget everything that happened before a certain time? Instead, we can constraint the strategy to be finite-state, and maybe require it to be communicating (i.e. forbid “grim triggers” where players change their behavior forever). On the game side, we can consider arbitrary repeated games, or communicating stochastic games (they have to be communicating because otherwise we can represent one-shot games), or even communicating partially observable stochastic games. This leads me to the following bold conjecture:
Consider any suitable (as above) game in which strategies are constrained to be (communicating?) finite-state. Then, at the limit γ→1, all thermodynamic equilibria are Pareto efficient.
This would be the sort of result that seems like it should be applicable to learning agents. That is, if the conjecture is true, there is good hope that appropriate learning agents are guaranteed to converge to Pareto efficient outcomes. Even if it the conjecture requires some moderately strong additional assumptions, it seems worth studying.
What happens in ASP? (Say you’re in an iterated Newcomb’s problem with a predictor much slower than you, but which meets the LIC or similar.) I’m concerned that it will either settle on two-boxing, or possibly not settle on one strategy, since if it settles on two-boxing then a model which says “you can get the higher reward by one-boxing” (ie, the agent has control over the predictor) looks appealing; but, if it settles on one-boxing, a model which says “you can get higher reward by two-boxing” (ie, the agent’s action doesn’t control the predictor) looks appealing. This concern is related to the way asymptotic decision theory fails—granted, for cases outside of its definition of “fair”.
I agree that something like this generally does the right thing in most cases, with the exception of superrationality in games as a result of commitment races.
I still have a little hope that there will be a nice version, which doesn’t involve a commitment-races problem and which doesn’t make use of an arbitrary commitment cutoff. But I would agree that things don’t look good, and so it is reasonable to put this kind of thing outside of “fair” problems.
I don’t currently see why we shouldn’t ask to converge to pareto optima. Obviously, we can’t expect to do so with arbitrary other agents; but it doesn’t seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm. This even seems reasonable in the standard iterated Nash picture (where not all strategies achieve pareto optima, but there exist strategies which achieve pareto optima with a broad-ish class of other strategies, including others who use strategies like their own—while being very difficult to exploit).
But yeah, I’m pretty uncertain about what the desiderata should be—both with respect to game theory, and with respect to scenarios which require updatelessness/precommitments in order to do well. I agree that it should all be approached with a learning-theoretic perspective.
I am not sure what you mean by “meets the LIC or similar” in this context. If we consider a predictor which is a learning algorithm in itself (i.e., it predicts by learning from the agent’s past choices), then the agent will converge to one-boxing. This is because a weak predictor will be fully inside the agent’s prior, so the agent will know that one-boxing for long enough will cause the predictor to fill the box. If we consider a predictor that just analyzes the agent’s source code and ignores the agent’s choices, the agent will converge to two-boxing.
I was never convinced that “logical ASP” is a “fair” problem. I once joked with Scott that we can consider a “predictor” that is just the single line of code “return DEFECT” but in the comments it says “I am defecting only because I know you will defect.” It was a joke, but it was half-serious. The notion of “weak predictor” taken to the limit leads to absurdity, and if you don’t take it to the limit it might still lead to absurdity. Logical inductors in one way to try specifying a “weak predictor”, but I am not convinced that settings in which logic is inserted ad hoc should be made into desiderata.
I am not sure we need an arbitrary cutoff. There might be a good solution where the agent can dynamically choose any finite cutoff.
Maybe? The questions are, how robust is this cooperation (i.e. what counts as “same algorithm”), and whether there is a significant cost in other situations. And, on the philosophical-ish level, the question is whether such desiderata should be considered essential for rationality/intelligence. But, I agree that this is worth studying.
Yeah, that’s what I meant.
Suppose the interval between encounters with the predictor is long enough that, due to the agent’s temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)
But I wasn’t sure this would be the result for your agent, since you described the agent using the hypothesis which gives the best picture about achievable utility.
As I discussed in Do Sufficiently Advanced Agents Use Logic, what I tend to think about is the case where the agent doesn’t literally encounter the predictor repeatedly in its physical history. Instead, the agent must learn what strategy to use by reasoning about similar (but “smaller”) scenarios. But we can get the same effect by assuming the temporal discounting is steep enough, as above.
Yeah, it is clear that there has to be a case where the predictor is so weak that the agent should not care. I’m fine with dropping the purely logical cases as desiderata in favor of the learning-theoretic versions. But, the ability to construct analogous problems for logic and for learning theory is notable. Paying attention to that analogy more generally seems like a good idea.
Yeah, I guess we can do a variety of things:
Naming a time limit for the commitment.
Naming a time at which a time limit for the commitment will be named.
Naming an ordinal (in some ordinal notation), so that a smaller ordinal must be named every time-step, until a smaller ordinal cannot be named, at which point the commitment runs out
I suspect I want to evaluate a commitment scheme by asking whether it helps achieve a nice regret-bound notion, rather than defining the regret notion by evaluating regret-with-respect-to-making-commitments.
Thinking about LI policy selection where we choose a slow-growing function f(n) which determines how long we think before we choose the policy to follow on day n –– there’s this weird trade-off between how (apparently) “good” the updatelessness is vs how long it takes to be any good at all. I’m fine with notions of rationality being parameterized by an ordinal or some such if it’s just a choose-the-largest-number game. But in this case, choosing too slow-growing a function makes you worse off; so the fact that the rationality principle is parameterized (by the slow-growing function) is problematic. Choosing a commitment scheme seems similar.
So it would be nice to have a rationality notion which clarified this situation.
My main concern here is: the case for empirical updatelessness seems strong in realizable situations where the prior is meaningful. Things aren’t as nice in the non-realizable cases such as logical uncertainty. But it doesn’t make sense to abandon updateless principles altogether because of this!
Yes, but in this case the agent should two-box. This agent prefers sacrificing the future for short term gain, so that’s what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxes and causing the predictor to cooperate on the next round.
Btw, I am always interested in the asymptotic when the time discount parameter goes to 1, rather than the asymptotic when time discount is fixed and time goes to ∞. That is, when I say the agent “converges” to something, I mean it in the former sense. Because, if the agent mostly cares only about the next thousands years, then it matters little how successful it is a million years from now. That is, the former asymptotic tracks the actual expected utility rather than some arbitrary portion thereof. (Also, I think it is interesting to consider what I dubbed “semi-anytime” algorithms: algorithms that receive a lower bound for the time discount parameter as input and guarantee a level of performance (usually regret) that improves as this lower bound grows. Even better are anytime algorithms: algorithms that don’t depend on the time discount at all, but come with a performance guarantee that depends on the time discount. However, in settings such as Delegative RL it is impossible to have an anytime algorithm.)
Ok, sort of, but then this makes the discounting “wrong” in the same way that hyperbolic discounting is “wrong”: dynamic inconsistency. One might then say such decision-theoretic questions reduce to a question of what the right way to discount is (constraints on discounting functions such that we rule out the ones which are “wrong”). I find this perspective somewhat plausible but not obviously right.
I wish I could get you to see the other possible perspective, in which there is something going wrong which is not about discounting behavior. IE, something closer to your remark about commitments. The two-boxer we are discussing can imagine the one-boxing strategy and can have regret with respect to it. Properly defining that notion of regret would allow a learner to one-box.
Maybe I’m still thinking about it the wrong way, but, I still am not convinced two-boxing is a legit values issue here. I think steep discounting doesn’t justify two-boxing. Imagine that we let both the ASP predictor and the agent think for a while before the first round—so the agent still has a lot more processing power than the predictor, but, the predictor is likely to get things right even on the first round. If the agent had been the sort of agent who would one-box, it would get higher reward on the first round. So it doesn’t seem to me like the steep-discounting agent “really prefers to two-box”.
The predictor can do well w/o any actual rounds being run because it is a logical inductor, so, is learning from closely analogous cases across logical possibility. Of course for this to make sense I have to assume the predictor has access to the agent’s source code. I’m also imagining that the agent has the predictor’s source code.
Ok. So in some sense you’re thinking in the “there are incorrect discounting functions” way (ie, steeper is always worse than less steep).
But, what if this is the last newcomblike problem the agent can reasonably expect to be involved in? There will be no significant pressure from the future, so the agent will tend to defect/two-box/etc. But the predictor can see this coming (even with relatively little processing power, as in ASP). So the agent does worse overall than a (predictable) 1-boxer.
And, of course, if there’s a predictable last interaction, the argument via pressure from the future tends to unwind so that actually there will be many rounds on which the gains from 1-boxing are lost rather than only one.
Look, a one-boxing agent would get less utility. Yes, it would have higher rewards on subsequent rounds, but the first round is more important, by the very assumption you made (steep time discount). Moreover, the two-boxing agent would get higher utility than any agent that one-boxes on some of the rounds, so on all rounds the correct action is two-box.
I don’t think it makes sense to simultaneously assume that the agent has a lot more power than the predictor and that the predictor is likely to get things right even on the first round. If the agent has a lot more power than the predictor, then the agent can e.g. diagonalize against the predictor and make sure it will not get things right.
Once again, we need to ask how does the agent know the predictor gets things right on the first round. It needs to learn it somehow. For example, maybe in encounters many predictors one after the other. But then, it will again learn a model in which one-boxing causes the predictor to cooperate.
(Notice that having a perfect predictor requires that either (i) the agent is deterministic or (ii) the predictor has access to the agent’s random bits. Option (i) is possible for the case of complete models (because there is a deterministic Bayes-optimal policy; to establish weak feasibility you can use PSRL + pseudorandom), which is sufficient for Newcomb’s paradox. For incomplete models you usually want to randomize, although it also possible to instead do deterministic maximin and optionally add an external random bit generator which is not a priori assumed to be invisible to the environment. In option (ii), you can consider a variant of Newcomb’s where the predictor responds to the probability of the agent making a particular choice. This variant is more difficult for reasons similar to what happens with counterfactual mugging: it is hard for the agent to discover this behavior empirically. I think that, as opposed to counterfactual mugging, you can capture it by an incomplete model, although it will be a relatively complex, difficult to learn, model.)
Well, the agents I consider interesting start from a state of ignorance, so they have to learn. For this, the time discount has to be shallow, otherwise there is no time to learn.
I agree in the case where the predictor is only good at all later, ie, its initial guess is random/fixed. Which is consistent with the way I originally brought ASP up. In that case, it absolutely makes sense to 2-box if your discounting is steep enough.
But, deliberation takes time. We can think of logical inductors, or Skyrms-style deliberation setups (Skyrms, The Dynamics of Rational Deliberation). The agent has a certain amount of time to think. If it spends all that time calculating probabilities (as in LIDT), and only makes a decision at the end, then it makes sense to say you can’t successfully 1-box in a way which makes you predictably do so. But if you model the decision as being made over an amount of time, then you could possibly make the decision early on in your deliberation, so that the predictor can see. So, we can think of something like a commitment which you can make while still refining probabilities.
Again, I agree this doesn’t make sense if the predictor is absolutely immovable (only takes our choice as evidence to refine its prediction in later rounds). But if the predictor is something like a logical inductor, then even if it runs much slower than the agent, it can “see” some of the early things the agent does.
Of course, there is the question of how you would know that making an early commitment rather than thinking longer would be a good strategy. I’m imagining they get each other’s source code. We can think about it in a learning-theoretic way as follows. Suppose I want good behavior on problems where two AIs exchange source code and then have to play a game (with a certain amount of time to think before they have to act). The system can do anything with its time; think longer, make early commitments, whatever.
Actually, let’s simplify a little: since the predictor is not fully an agent, we can just think in terms of building a system which is given source code which outputs the end utility (so the source code includes the predictor, the agent itself, everything). I want a system which does well on such problems if it has enough time to think. In order to get that, what I’m going to do is have it spend the first fraction of its thinking time training itself on smaller problems sampled at random. (It could and should choose instances intelligently, but let’s say it doesn’t.) It treats these entirely episodically; in each instance it only tries to maximize the score in that one case.
Nonetheless, it can eventually learn to make its decision early in cases where it is beneficial to be predictable.
This is all supposed to be a model of what our agent can do in order to come to a decision in ASP. It can think about similar problems. It can notice a trend where it’s useful to make certain decisions early. Sure, it might figure that out too late, in which case the best it can do is probably 2-box. But it might also figure it out in time.
It can do that, which would make it unpredictable. However, it can also not do that. In this case it has no special motivation to do that, so there’s no reason to, which means there’s no reason for the predictor to expect that.
Not sure whether this was a point you were trying to make, but, I agree that in the case of a perfect predictor learning theory does a fine job of 1-boxing regardless of temporal discounting or any commitment mechanisms.
I’m worrying that the overarching thread may get lost. I think this point about ASP is an important sub-point, for a few reasons, but I hope the discussion doesn’t dead-end here. My picture of where we’re at:
You initially made a claim that you could get some subset of decision problems, such as XOR, via learning-theoretic means. I was trying to learn more about that.
You made a comment about problems outside the subset you can get being addressable with commitment mechanisms, while adding that you weren’t certain this should be a desiderata.
I was interested in the claim about commitment mechanisms, because I think you can get somewhat messy/patchwork kind-of-updateless-ish stuff out of that, but would be pretty interested if there’s an elegant version.
We started into differences of intuition about what is desirable.
We settled into going back and forth about ASP.
I’ve gravitated to ASP partly because I hope it illustrates something about how I see logic as relevant to a learning-theoretic approach to decision theory. But I’m still interested more broadly in what class of decision problems you think you can do well on, with and without commitment mechanisms.
Yes, sounds sort of reasonable. Here is how I think you can realize this using TRL.
As usual, we consider the agent playing an IPD against a predictor (Newcomb’s paradox is essentially playing the Prisoner’s Dilemma against a “FairBot”). On each round, the predictor gets to see the agent’s state at the start of the round. (The state can be considered part of the “source code”. For randomizing agents, we also assume the predictors sees the random bits). The predictor then tries to simulate the agent (we assume it knows the rest of the agent’s source code as well), and is successful if the agent doesn’t execute any programs that are too expensive for the predictor (for the sake of simplicity, assume that no program started on one round continues running during following rounds: I don’t think that this assumption makes a difference of principle). Otherwise, the prediction might be wrong (for example, we can assume it defaults to D). The predictor then plays D or C according to its prediction of the agent’s action.
In this setting, the agent can learn the incomplete hypothesis “if I don’t run expensive programs and I play C, the predictor will also play C”. (We assume that the prior allows for side effects of executing programs. Such a prior seems more realistic anyway, and in particular is required to counter non-Cartesian daemons. However, it also has a cost, so perhaps what we really want is a prior that is biased towards few side effects: but, this is immaterial for the current discussion.) This hypotheses guarantees a payoff of U(CC). Assuming that the predictor cannot be exploited, this is the best possible payoff and therefore the agent will converge to cooperation.
We might want a more explicit realization of the “simulates” part in “agent simulates predictor”. For this, we can assume the agent also receives its own state as an observation (but, I’m not sure how generally useful is this for realistic agents). The agent can then also learn the incomplete hypothesis describing the exact function from agent states to predictor output. However, this hypothesis doesn’t affect the outcome: it doesn’t predict the agent’s state and therefore can only guarantee the payoff U(DD).
I’m leaning this way as well, but I think it’s an important clue to figuring out commitment races. ASP Predictor, DefectBot, and a more general agent will make different commitments, and these things are already algorithms specialized for certain situations. How is the chosen commitment related to what the thing making the commitment is?
When an agent can manipulate a predictor in some sense, what should the predictor do? If it starts scheming with its thoughts, it’s no longer a predictor, it’s just another agent that wants to do something “predictory”. Maybe it can only give up, as in ASP, which acts as a precommitment that’s more thematically fitting for a predictor than for a general agent. It’s still a commitment race then, but possibly the meaning of something being a predictor is preserved by restricting the kind of commitment that it makes: the commitment of a non-general agent is what it is rather than what it does, and a general agent is only committed to its preference. Thus a general agent loses all knowledge in an attempt to out-commit others, because it hasn’t committed to that knowledge, didn’t make it part of what it is.
I had an interesting thought that made me update towards your position. There is my old post about “metathreat equilibria”, an idea I developed with Scott’s help during my trip to Los Angeles. So, I just realized the same principle can be realized in the setting of repeated games. In particular, I am rather confident that the following, if formulated a little more rigorously, is a theorem:
Consider the Iterated Prisoner’s Dilemma in which strategies are constrained to depend only on the action of the opponent in the previous round. Then, at the limit γ→1 (shallow time discount), the only thermodynamic equilibrium is mutual cooperation.
Then, there is the question of how to generalize it. Obviously we want to consider more general games and more general strategies, but allowing fully general strategies probably won’t work (just like allowing arbitrary programs doesn’t work in the “self-modification” setting). One obvious generalization is allowing the strategy to depend on some finite suffix of the history. But, this isn’t natural in the context of agent theory: why would agents forget everything that happened before a certain time? Instead, we can constraint the strategy to be finite-state, and maybe require it to be communicating (i.e. forbid “grim triggers” where players change their behavior forever). On the game side, we can consider arbitrary repeated games, or communicating stochastic games (they have to be communicating because otherwise we can represent one-shot games), or even communicating partially observable stochastic games. This leads me to the following bold conjecture:
Consider any suitable (as above) game in which strategies are constrained to be (communicating?) finite-state. Then, at the limit γ→1, all thermodynamic equilibria are Pareto efficient.
This would be the sort of result that seems like it should be applicable to learning agents. That is, if the conjecture is true, there is good hope that appropriate learning agents are guaranteed to converge to Pareto efficient outcomes. Even if it the conjecture requires some moderately strong additional assumptions, it seems worth studying.