Attention Conservation Notice: This is a moderately mathy post.
Affineness, it’s Useful!
So, if we’re going to be restricting the sorts of environments we’re considering, and trying to build an algorithm that’s closer to UDT1.0 (just pick your action to optimize global utility without the whole “coordinating with alternate versions of me” aspect), it’s probably worthwhile to consider what sorts of environments naturally work well with UDT1.0. What sorts of environments/games let you craft a notion of “the effect that a single player is having” in a way that doesn’t really depend on what everyone else is doing?
If we’ve got some global high-dimensional maximization problem, maxx,y,zf(x,y,z), what sorts of functions let you split it into a bunch of little low-dimensional maximization problems that don’t interact? Well, if f is a sum of a bunch of other functions which only depend on a single input, then something like maxx,y,z(f1(x)+f2(y)+f3(z)) splits up as maxxf1(x)+maxyf2(y)+maxzf3(z).
If we’ve got some game with a bunch of players, what sort of game lets you consider the plays of the players individually without worrying about interactions between the actions of the various different players? Linear games. If Ai is the space of player i’s actions, then a game in general would have every player i be associated with a utility function Ui:(∏jAj)→[0,1]. But in linear games, everyone’s utility function Ui breaks down as a weighted sum of component utility functions Ui,j:Aj→[0,1], which could be viewed as “how much player i likes or dislikes the action that player j took”. No interaction between the different actions. How much player i approves of j’s action has nothing to do with the actions that everyone else is taking. I actually have a whole ton to say about linear games, they’re incredibly interesting, but that’s probably a matter for another post that isn’t a part of this sequence.
So what’s the analogue of this “split stuff into a weighted sum” thing for policy selection environments? Well, permitting probabilistic choice of actions, if O<n is the space of histories of length less-than-n (places where we can make a decision), and our action space is A, and we’re playing up to a time horizon of n, then the space of policies is..
∏:=O<n→ΔA
And, from the previous post, a policy-selection environment (in full generality) is one where there’s a function
O<n→(∏→ΔO)
which, given a position/situation/history, maps your policy to a probability of which observation applies next. Hang on, deabbreviating the types and rewriting things a little bit, we can rewrite this type signature as…
O<n→((ΔA)O<n→ΔO)
Hm… What sort of function (in that second half), would correspond to our desired notion of being able to make a choice of what-to-do in a way that doesn’t depend on what we’re doing everywhere else?
Answer: an affine function. Affine functions are linear functions, plus some constant. Affineness is equivalent to “the derivative is constant everywhere”, which is a compressed way of saying “twiddling some input coordinates (ie, tampering with our probabilities of actions in some situation), has an effect on the output coordinates (the probabilities of an observation) which is independent of which point you picked (ie, which overall policy you have)”. And that’s just our desired notion of being able to talk about “the effect of my action right here” in a way that doesn’t depend on what our policy is doing at other places.
Definition 1: Strong Locally Affine Environment
A policy selection environment e:O<n→(∏→ΔO) (which maps a history and global policy to a probability of an upcoming observation) is a strong locally affine environment iff, for all histories h, the function π↦e(h,π), of type signature (ΔA)O<n→ΔO, is an affine function.
For the next definition, Aff(ΔO), the affine hull of probability distributions, is just “the set of vectors whose numbers sum up to 1”.
Aff(ΔO):={x∈RO|∑o∈Oxo=1}
Definition 2: Locally Affine Environment
Same as Definition 1, but instead of the type signature having ΔO, it has Aff(ΔO) instead.
Benefits of Locally Affine Environments
There are three enormous benefits we get from dealing with locally affine environments.
The first benefit is flexibility. If we have some notion of a “default” policy, which we’re evaluating deviations from, we can stuff pretty much every policy selection environment into this framework as a locally affine environment. Why? Well, given some fancy function from one vector space to another, you can pick a point and differentiate there to get a “tangent line/tangent plane/tangent function” that’s a decent local approximation of the fancy function, and is also affine. So, for our full function which maps our policy to probability of the next observation, of type (ΔA)O<n→ΔO, we can pick an input point (a “default” policy that we’re evaluating small changes from), and differentiate, and the “tangent” function we get will be an affine function of type (ΔA)O<n→Aff(ΔO). Oh hey that’s a locally affine environment!
The second benefit is in the complexity of specifying a policy selection environment. From the last post, a general environment
O<n→(∏→ΔO)
would, since there’s about |A||O|n policies, take about |A||O|n⋅|O|n+1 numbers to specify. A “locally affine environment”, by contrast, would take about (if I did my math right) |A|⋅|O|2n+1 numbers to specify. We just went from double-exponential to exponential right there! The reason we get such a savings is akin to how there are nm functions from the m element set to the n element set, but only n⋅m numbers are needed to describe a linear function Rm→Rn. You no longer need to describe what effects all policies have, just how changes in a policy at one place affect what happens at another place.
And the third benefit is that it produces the concept of an “influence measure”, which is the indispensable math tool for thinking at all clearly about the question “would UDT lock in a really stupid policy at the start of time?”
Influence Measures!!
So, we seem to have our natural candidate for policy selection environments where UDT1.0 might possibly perform well. Namely, policy selection environments
O<n→((ΔA)O<n→ΔO)
Where that latter function (ΔA)O<n→ΔO isn’t just any old function, but is affine. Which is equivalent to “for every pair of histories h,h′ it is possible to meaningfully talk about the influence of actions at h′ on the probability of what comes after h, in a way which doesn’t depend on how our policy is behaving elsewhere”.
This is very close to the concept of an “influence measure”. But what’s that?
Well, influence measures can be defined (complicatedly) in the general setting, but they shine most clearly and are most comprehensible where there are only two actions and two observations. Let’s use B as an abbreviation for {0,1}, the two states of a bit. So we have A=O=B, and ΔO and ΔA both turn into the interval [0,1].
Now, fix a particular history/bitstring b. We’re looking at the odds of it being extended with 0 or 1. If your environment is locally affine, then there’s an affine function of type [0,1]B<n→R that b is associated with. But the space of affine functions like this just so happens to be isomorphic to…
M±(B<n)×R
Which is the space of signed measures over bitstrings!! (plus a constant) So, shoving our affine function through this isomorphism gets us the “signed influence measure of b”. Don’t worry, the constant doesn’t matter, because that constant will end up being “what score does the policy that plays action 0 everywhere get?” and we don’t care about that.
Wait… what’s a signed measure?? Well, basically it’s like a probability measure, except it can be negative in places, and doesn’t have to add up to 1.
If a bitstring b′ has positive signed influence measure, that means that if you play more of action 1 at b′, then the odds of observation 1 at b goes up. If the signed influence measure is 0 at b′, then playing more of action 1 at b′ has no effect on the probability of the various observations. If the signed influence measure is negative at b′, then playing more of action 1 at b′ boosts the odds of observation 0 at b instead. So the sign of the signed measure is encoding the direction of the influence.
Given a signed measure m, |m|, the “absolute measure”, is basically the measure (no negative parts) you’d get by flipping all the negative-measure parts of m to positive measure. It’s measuring the magnitude of influence. This is the “influence measure”.
So, the “influence measure of b”, is the absolute measure associated with the signed measure that’s associated with the affine function π↦e(b,π). It’s a measure over bitstrings b′. If the influence measure of b is high on b′, that means “your action in situation b′ has a large influence on what happens in situation b”. If the influence measure of b is low on b′, that means “your action in situation b′ has a small influence on what happens in situation b”.
Basically, the “influence measure of b” tells you which situations are important for influencing what happens at b. If the influence measure of b has 0.03 measure on b′, then that means that switching your action at b′ from 0 to 1 (or vice-versa), affects the probability of which observation comes after b by 3 percent.
Early Precommitments, Permanent Stupidity
So, if UDT is locking in its decisions at the start of time, will it be permanently bound to the incredibly crappy decision making of its initial self?
Well… it’s complicated. Different proposals may or may not act in that way. Influence measures are definitely the right tool for thinking about whether or not it happens, though.
Lets use mb(b′) to denote “the influence of actions at b′ on what happens at b”, and assume we’ve got some “baseline” probability distribution μ:Δ(Bn), a probability distribution over histories.μ(b) when b is shorter than length n is “probability of getting some history that has b as a prefix”. As an approximation, we can say that the effect on overall expected utility via b of our action at b′ is, at most, μ(b)⋅mb(b′). Messing with what happens in a low-probability situation has negligible effect. Messing with what happens in a high-probability situation has more effect.
Using ∅ to denote the empty string, our approximation for whether your decision is dominated by “make things go well ordinarily in the history b′ that I’m in”, or “gotta make the start of time go well”, is whether μ(b′)>m∅(b′) or not. If the probability of your situation is is a lot higher than your influence on the start of time, that’s a proxy for “the dominating effect is making the situation I’m in go well”. If the influence measure is a lot higher, that’s a proxy for “the dominating effect is my retroactive influence on the start of time”.
So, you’ve got two measures, one of “influence on the start of time” which is distributed on the tree nodes, and the other measure is the flow of probability-juice through the tree, and where the influence measure on a tree node exceeds the amount of probability-juice flowing through that node, your decision there will tend to be controlled by retroactive effects.
Well, technically, you’re summing up all of the influence measures, not just the one at the start of time. But you’re weighting influence measures from low-probability histories less. So, uh… it’s kind of a huge mess. But in general, probability-exceeding-influence is a decent guiding star for whether your decision gets controlled by retroactive effects. Using h1:n to denote “the first n observations of infinite history h”, if you were able to prove something like “for every infinite history h and finite history b…”
limsupn→∞1nn∑i=0mb(h1:i)μ(h1:i)=0
That would mean that for any finite amount of “idiot past-selves making stupid precommitments”, in the long-time-limit, the frequency of you deferring to their idiocy limits to 0.
There’s a bunch of possibly related desiderata, like “with probability 1, any history sampled from the true environmental probability-distribution has this property”, or demanding that the above quantity limit to 0 fast enough, or bounding the expected size of the sum… You’ve got a bunch of variants for how to formalize “outgrowing the stupidity of your early-self”, but the key part is that you’re able to talk mathematically about it at all.
So, with influence measures, it’s finally possible to formalize desiderata and nice properties of the form “must outgrow early bad precommitments”, as well as being able to point to potential UDT designs and go “from pondering influence measures, I have determined that this specific design is very vulnerable to locking in stupid decisions at the start of time”, or “from pondering influence measures, I have determined that this algorithm outgrows its initial foolishness in a reasonable amount of time”
No Traps?
There’s a nonobvious property that strong locally affine environments have, that’s an UDT equivalent of the no-traps assumption. Let’s take that classic example of a UDT trap, the environment that rewards you if you follow a specific policy, and punishes you harshly if you ever deviate from it. If such an environment is part of your prior, you could just get locked into following that policy for the rest of time, and you can’t even update out of it because you’ll have unfalsifiable beliefs about how you’re in a counterfactual and deviating would actually make things go badly at the start of time. Unfalsifiable beliefs should only be able to make bounded demands of you. Now, for this environment that mandates you follow a certain policy, you can differentiate it to get a locally affine thing. Turning that thing into an influence measure, we get that the affine environment is very sensitive to policy changes, and the influence measure is high everywhere. Even a little derailment anywhere has drastic effects on expected utility, and this is how that environment exerts its tyrannical grasp on the decision of the agent no matter the situation the agent is in. No matter how many gigabytes of data the agent digests, it still thinks that its decision has lots of influence on what happens at the start of time. Best to play it safe and stick to the initial plan.
And so, we might demand a sort of fairness condition, or no-traps assumption, that’s something like “the importance measure at any particular spot has to be like a probability distribution in the way its sum is 1 or less over all the other spots. It can’t be high everywhere, or else everywhere will be affected by the tyranny of having to optimize that particular spot”.
As it turns out, strong locally affine environments fulfill this particular flavor of no-trap assumption, while locally affine environments in general might not. Intuitively, the reason why is that, if the influence measure is large enough for some situation h, there will always be some policy that forces the probability of observations after h to be extreme enough that it isn’t an actual probability distribution, just a vector whose entries sum up to 1. If decisions everywhere matter greatly for whether some observation happens, there’s a policy that forces the probability of that observation to be above 1, or negative. Therefore, if there’s a guarantee that the probability of upcoming observations is always an actual probability distribution, the total influence measure has to be low-ish.
As it turns out, this is dramatically stronger than what you need, and also actually slighly weaker than what you need. Influence measure can sum up to infinity (for infinite time horizon) without too many ill consequences, as long as it has that “frequency limiting to 0 everywhere” property. The property which strong locally affine environments give you is a bit closer to “the expected number of times you defer to your past self is 1 or less”. For histories which you think are very improbable, the frequency of deferring to your past self might remain high forever.
So, yeah, influence measures are THE natural mathematical tool for analyzing “will UDT lock in stupid precommitments forever” questions.
Affineness Failure and the Surprise Game
Just because you have affineness for “probability of next observation”, doesn’t mean that you have affineness for expected utilities, which is what you need for UDT1.0 to be optimal.
Let’s define the Surprise Game, to give a toy example.
You can choose to recieve a safe reward of 0.5 utility, or you can go fight a monster for resources. If you catch it unprepared, you can win the fight for 1 utility. If you catch it prepared, you lose the fight for 0 utility. The probability of the monster being prepared is the same as the probability of you picking the fight with the monster.
It’s locally affine in probabilities. The probability of safe reward vs fight varies linearly with your probabilities of action. The probability of winning vs losing varies linearly with your probabilities of action. However, the overall utility you get varies nonlinearly with the probability of picking a fight. If p is the probability of you fighting the monster, then your utility is −p2+0.5p+0.5. This has an optimum at a 1-in-4 chance of fighting the monster, for 916 utility.
So, even though everything’s locally affine, the overall problem you’re in doesn’t have expected utility being affine w.r.t. your actions, and fancier variants of this can have your optimal-choice-of-action-at-a-spot get entangled with what your policy is doing somewhere else. So UDT1.0 isn’t optimal even if you assume local affineness.
This sort of setup, where what you’re doing elsewhere does matter, recurs again and again. If you know a “baseline” overall policy in a general policy selection environment, it lets you differentiate and get a locally affine environment. If you’re in Vanessa’s InfraBayes setting, the “dynamically consistent update” that gets “updateless” behavior correct requires that you know how your policy behaves in the branches you didn’t end up in. If you’re choosing between a safe option and one that gives you a difficult-to-exploit opportunity, the best choice depends on how sure you are that you’ll exploit that opportunity. If you’re in a Parfit’s Hitchhiker scenario, technically your decision about whether to repay your ride back to town depends on the odds you’d assign to surviving the desert and becoming king of some awesome desert tribe.
Even though, for single observations, you can ignore what the other variants of you are doing, that doesn’t mean you can ignore your overall policy in the broader picture.
So we’re back to UDT1.1, the “compute the optimal policy at the start of time” option, right? We’d need affineness in expected utilities for UDT1.0 to be optimal. And we don’t have that. In fancier variants of the Surprise Game, the equation giving expected utility for probabilities of actions is gonna be some ridiculous polynomial. And then you’ve gotta find the global optimum of that polynomial.
Policy Gradient Ascent
Correct. Which is why we’ll set our sights a little bit lower than that. We’ll just figure out how to gradient-ascent to a pretty good policy. If there’s a big multi-player game going on, where everyone in the game (the yous in different circumstances) has the same utility function, but can’t coordinate too well, and can make observations and guesses about what everyone else will do (what the global policy is), and try to act to repeatedly make things a little better than what they expect will happen and know that everyone else is doing that too… It might be enough. Policy space is high-dimensional. Gradient stuff can be expected to work well in practice.
So let’s see how UDT1.0 would handle it. It’s important to note that that the exact same trick of “if it’s not affine, take a derivative and it suddenly is!” still works on our (non-affine) function mapping policies to expected utilities! If we’ve got a baseline guess at what we do in various different circumstances, we should be able to go “alright, if I do a little bit more of this action here, it has these effects on expected utility...”
So, given a “baseline policy” (probability of fighting the monster), we take a derivative, and figure out whether it’s better to play a tiny little bit more of “go fight it”, or “play it safe”. Basically, gradient ascent in action space, wrt expected utility. If you’re in a situation, figure out the effects of playing a little bit more/less of various actions than you’d otherwise do by default. Your estimate of your global policy moves a bit more in that direction. Repeat.
Of course, you might not hit the global optimum this way, but if there’s a ridiculously huge and high-dimensional policy-space to wander through, then gradient ascent might be expected to get somewhere pretty dang good.
Really Dumb Gradient Ascent
But there’s another option which looks dramatically more idiotic and actually manages to do surprisingly well. And it involves computing the gradient once, instead of many times. And it’s even deterministic instead of probabilistic! No need for random numbers to decide what to do, if you don’t have them.
It is as follows. Compute the gradient once and deterministically play the action the gradient most recommends you playing more of. If it looks better to play a tiny little bit more of “fight the monster”, then just go fight the monster!
Huh?
Well, here’s why it works. You’ll be in a situation where, if you think you have a less than 1-in-4 chance of fighting the monster, you deterministically pick a fight, and if you think it’s more than 1-in-4, you deterministically play it safe.
This is really similar to those diagonalization sentences in logical inductors, like “I have probability less than 1-in-4”. If the probability is less than that, you wanna bet on the sentence. If it’s more than that, you wanna bet against the sentence. And so your beliefs converge to hovering right around the 1-in-4 probability, and a quarter of the time, it’s lower, and three quarters of the time, it’s higher, and you don’t know what you’re going to do. Basically, deterministically randomizing.
So, if you’re, say, using a logical inductor to predict the probabilities of various histories, and expected utilities, and what you’d do in various circumstances, the inductor traders will go “hang on, if this-and-such action is the best one to play a bit more of, it’ll get played, which will alter the history distribution and other stuff… which would then mean it isn’t the most highly recommended action… hang on, we’ve gotta have beliefs where, taking into account that the agent will pick the best action given those beliefs, those beliefs are accurate”, and you should end up basically randomizing at a local optimum in policy space if the traders are competent/your logical inductor is in good working order.
So, our version of UDT1.0 suitable for small dumb agents will probably be something like “have beliefs about the probability distribution over history, and your own policy, and how you (acausally, and retroactively, and causally) influence events, and from this work out the best action to play a tiny little bit more of where you are. Then just do that action deterministically”. This is because, if the belief part is in good working order, this process should home in on local optima in policy space. Which is really really high-dimensional, so the usual arguments about gradient ascent not getting stuck should hold, and it might be expected to perform well in practice.
Takeaways
We can guess at which sorts of environments are suitable for UDT, ie, make simplifying assumptions to get beyond the standard double-exponential bullshit. The one that looked good was that the function mapping policies to probabilities of observations should always be affine. This cuts down the complexity of environments from double-exponential to exponential, with similar savings in time complexity of figuring out what to do. Thus, for bounded agents, phenomena appearing in this more restricted setting are worthy of study and likely to apply to them. Also, general environments can be differentiated (if you know your policy) to cram them into the locally-affine environment setting.
This local affineness assumption lets you formalize a notion called “influence measures”, which tell you how much a decision in one circumstance affects what happens in another circumstance. The notion of “influence measures” is absolutely essential for formalizing the standard worries about UDT locking in stupid behavior at the start of time, and seeing whether they hold or not, and formulating desiderata related to that.
However, local affineness in probabilities doesn’t get you the sort of affineness you’d need for UDT1.0 to be globally optimal, as evidenced by the toy “Surprise Game” we looked at. So we may try to do a sort of gradient ascent in policy space, to get a “good enough” solution with the aid of the high-dimensional nature of policy space. In fact, even a really dumb form of gradient ascent (namely, “figure out which action is best to play a little bit more of, where you are, and then deterministically play that”), can be expected to ascend up to locally optimal behavior over time as long as the epistemics are doing alright and realize this is happening.
UDT1.01: Local Affineness and Influence Measures (2/10)
Attention Conservation Notice: This is a moderately mathy post.
Affineness, it’s Useful!
So, if we’re going to be restricting the sorts of environments we’re considering, and trying to build an algorithm that’s closer to UDT1.0 (just pick your action to optimize global utility without the whole “coordinating with alternate versions of me” aspect), it’s probably worthwhile to consider what sorts of environments naturally work well with UDT1.0. What sorts of environments/games let you craft a notion of “the effect that a single player is having” in a way that doesn’t really depend on what everyone else is doing?
If we’ve got some global high-dimensional maximization problem, maxx,y,zf(x,y,z), what sorts of functions let you split it into a bunch of little low-dimensional maximization problems that don’t interact? Well, if f is a sum of a bunch of other functions which only depend on a single input, then something like maxx,y,z(f1(x)+f2(y)+f3(z)) splits up as maxxf1(x)+maxyf2(y)+maxzf3(z).
If we’ve got some game with a bunch of players, what sort of game lets you consider the plays of the players individually without worrying about interactions between the actions of the various different players? Linear games. If Ai is the space of player i’s actions, then a game in general would have every player i be associated with a utility function Ui:(∏jAj)→[0,1]. But in linear games, everyone’s utility function Ui breaks down as a weighted sum of component utility functions Ui,j:Aj→[0,1], which could be viewed as “how much player i likes or dislikes the action that player j took”. No interaction between the different actions. How much player i approves of j’s action has nothing to do with the actions that everyone else is taking. I actually have a whole ton to say about linear games, they’re incredibly interesting, but that’s probably a matter for another post that isn’t a part of this sequence.
So what’s the analogue of this “split stuff into a weighted sum” thing for policy selection environments? Well, permitting probabilistic choice of actions, if O<n is the space of histories of length less-than-n (places where we can make a decision), and our action space is A, and we’re playing up to a time horizon of n, then the space of policies is.. ∏:=O<n→ΔA And, from the previous post, a policy-selection environment (in full generality) is one where there’s a function O<n→(∏→ΔO) which, given a position/situation/history, maps your policy to a probability of which observation applies next. Hang on, deabbreviating the types and rewriting things a little bit, we can rewrite this type signature as… O<n→((ΔA)O<n→ΔO) Hm… What sort of function (in that second half), would correspond to our desired notion of being able to make a choice of what-to-do in a way that doesn’t depend on what we’re doing everywhere else?
Answer: an affine function. Affine functions are linear functions, plus some constant. Affineness is equivalent to “the derivative is constant everywhere”, which is a compressed way of saying “twiddling some input coordinates (ie, tampering with our probabilities of actions in some situation), has an effect on the output coordinates (the probabilities of an observation) which is independent of which point you picked (ie, which overall policy you have)”. And that’s just our desired notion of being able to talk about “the effect of my action right here” in a way that doesn’t depend on what our policy is doing at other places.
Definition 1: Strong Locally Affine Environment
A policy selection environment e:O<n→(∏→ΔO) (which maps a history and global policy to a probability of an upcoming observation) is a strong locally affine environment iff, for all histories h, the function π↦e(h,π), of type signature (ΔA)O<n→ΔO, is an affine function.
For the next definition, Aff(ΔO), the affine hull of probability distributions, is just “the set of vectors whose numbers sum up to 1”.
Aff(ΔO):={x∈RO|∑o∈Oxo=1}
Definition 2: Locally Affine Environment
Same as Definition 1, but instead of the type signature having ΔO, it has Aff(ΔO) instead.
Benefits of Locally Affine Environments
There are three enormous benefits we get from dealing with locally affine environments.
The first benefit is flexibility. If we have some notion of a “default” policy, which we’re evaluating deviations from, we can stuff pretty much every policy selection environment into this framework as a locally affine environment. Why? Well, given some fancy function from one vector space to another, you can pick a point and differentiate there to get a “tangent line/tangent plane/tangent function” that’s a decent local approximation of the fancy function, and is also affine. So, for our full function which maps our policy to probability of the next observation, of type (ΔA)O<n→ΔO, we can pick an input point (a “default” policy that we’re evaluating small changes from), and differentiate, and the “tangent” function we get will be an affine function of type (ΔA)O<n→Aff(ΔO). Oh hey that’s a locally affine environment!
The second benefit is in the complexity of specifying a policy selection environment. From the last post, a general environment O<n→(∏→ΔO) would, since there’s about |A||O|n policies, take about |A||O|n⋅|O|n+1 numbers to specify. A “locally affine environment”, by contrast, would take about (if I did my math right) |A|⋅|O|2n+1 numbers to specify. We just went from double-exponential to exponential right there! The reason we get such a savings is akin to how there are nm functions from the m element set to the n element set, but only n⋅m numbers are needed to describe a linear function Rm→Rn. You no longer need to describe what effects all policies have, just how changes in a policy at one place affect what happens at another place.
And the third benefit is that it produces the concept of an “influence measure”, which is the indispensable math tool for thinking at all clearly about the question “would UDT lock in a really stupid policy at the start of time?”
Influence Measures!!
So, we seem to have our natural candidate for policy selection environments where UDT1.0 might possibly perform well. Namely, policy selection environments O<n→((ΔA)O<n→ΔO) Where that latter function (ΔA)O<n→ΔO isn’t just any old function, but is affine. Which is equivalent to “for every pair of histories h,h′ it is possible to meaningfully talk about the influence of actions at h′ on the probability of what comes after h, in a way which doesn’t depend on how our policy is behaving elsewhere”.
This is very close to the concept of an “influence measure”. But what’s that?
Well, influence measures can be defined (complicatedly) in the general setting, but they shine most clearly and are most comprehensible where there are only two actions and two observations. Let’s use B as an abbreviation for {0,1}, the two states of a bit. So we have A=O=B, and ΔO and ΔA both turn into the interval [0,1].
Now, fix a particular history/bitstring b. We’re looking at the odds of it being extended with 0 or 1. If your environment is locally affine, then there’s an affine function of type [0,1]B<n→R that b is associated with. But the space of affine functions like this just so happens to be isomorphic to… M±(B<n)×R Which is the space of signed measures over bitstrings!! (plus a constant) So, shoving our affine function through this isomorphism gets us the “signed influence measure of b”. Don’t worry, the constant doesn’t matter, because that constant will end up being “what score does the policy that plays action 0 everywhere get?” and we don’t care about that.
Wait… what’s a signed measure?? Well, basically it’s like a probability measure, except it can be negative in places, and doesn’t have to add up to 1.
If a bitstring b′ has positive signed influence measure, that means that if you play more of action 1 at b′, then the odds of observation 1 at b goes up. If the signed influence measure is 0 at b′, then playing more of action 1 at b′ has no effect on the probability of the various observations. If the signed influence measure is negative at b′, then playing more of action 1 at b′ boosts the odds of observation 0 at b instead. So the sign of the signed measure is encoding the direction of the influence.
Given a signed measure m, |m|, the “absolute measure”, is basically the measure (no negative parts) you’d get by flipping all the negative-measure parts of m to positive measure. It’s measuring the magnitude of influence. This is the “influence measure”.
So, the “influence measure of b”, is the absolute measure associated with the signed measure that’s associated with the affine function π↦e(b,π). It’s a measure over bitstrings b′. If the influence measure of b is high on b′, that means “your action in situation b′ has a large influence on what happens in situation b”. If the influence measure of b is low on b′, that means “your action in situation b′ has a small influence on what happens in situation b”.
Basically, the “influence measure of b” tells you which situations are important for influencing what happens at b. If the influence measure of b has 0.03 measure on b′, then that means that switching your action at b′ from 0 to 1 (or vice-versa), affects the probability of which observation comes after b by 3 percent.
Early Precommitments, Permanent Stupidity
So, if UDT is locking in its decisions at the start of time, will it be permanently bound to the incredibly crappy decision making of its initial self?
Well… it’s complicated. Different proposals may or may not act in that way. Influence measures are definitely the right tool for thinking about whether or not it happens, though.
Lets use mb(b′) to denote “the influence of actions at b′ on what happens at b”, and assume we’ve got some “baseline” probability distribution μ:Δ(Bn), a probability distribution over histories.μ(b) when b is shorter than length n is “probability of getting some history that has b as a prefix”. As an approximation, we can say that the effect on overall expected utility via b of our action at b′ is, at most, μ(b)⋅mb(b′). Messing with what happens in a low-probability situation has negligible effect. Messing with what happens in a high-probability situation has more effect.
Using ∅ to denote the empty string, our approximation for whether your decision is dominated by “make things go well ordinarily in the history b′ that I’m in”, or “gotta make the start of time go well”, is whether μ(b′)>m∅(b′) or not. If the probability of your situation is is a lot higher than your influence on the start of time, that’s a proxy for “the dominating effect is making the situation I’m in go well”. If the influence measure is a lot higher, that’s a proxy for “the dominating effect is my retroactive influence on the start of time”.
So, you’ve got two measures, one of “influence on the start of time” which is distributed on the tree nodes, and the other measure is the flow of probability-juice through the tree, and where the influence measure on a tree node exceeds the amount of probability-juice flowing through that node, your decision there will tend to be controlled by retroactive effects.
Well, technically, you’re summing up all of the influence measures, not just the one at the start of time. But you’re weighting influence measures from low-probability histories less. So, uh… it’s kind of a huge mess. But in general, probability-exceeding-influence is a decent guiding star for whether your decision gets controlled by retroactive effects. Using h1:n to denote “the first n observations of infinite history h”, if you were able to prove something like “for every infinite history h and finite history b…” limsupn→∞1nn∑i=0mb(h1:i)μ(h1:i)=0 That would mean that for any finite amount of “idiot past-selves making stupid precommitments”, in the long-time-limit, the frequency of you deferring to their idiocy limits to 0.
There’s a bunch of possibly related desiderata, like “with probability 1, any history sampled from the true environmental probability-distribution has this property”, or demanding that the above quantity limit to 0 fast enough, or bounding the expected size of the sum… You’ve got a bunch of variants for how to formalize “outgrowing the stupidity of your early-self”, but the key part is that you’re able to talk mathematically about it at all.
So, with influence measures, it’s finally possible to formalize desiderata and nice properties of the form “must outgrow early bad precommitments”, as well as being able to point to potential UDT designs and go “from pondering influence measures, I have determined that this specific design is very vulnerable to locking in stupid decisions at the start of time”, or “from pondering influence measures, I have determined that this algorithm outgrows its initial foolishness in a reasonable amount of time”
No Traps?
There’s a nonobvious property that strong locally affine environments have, that’s an UDT equivalent of the no-traps assumption. Let’s take that classic example of a UDT trap, the environment that rewards you if you follow a specific policy, and punishes you harshly if you ever deviate from it. If such an environment is part of your prior, you could just get locked into following that policy for the rest of time, and you can’t even update out of it because you’ll have unfalsifiable beliefs about how you’re in a counterfactual and deviating would actually make things go badly at the start of time. Unfalsifiable beliefs should only be able to make bounded demands of you. Now, for this environment that mandates you follow a certain policy, you can differentiate it to get a locally affine thing. Turning that thing into an influence measure, we get that the affine environment is very sensitive to policy changes, and the influence measure is high everywhere. Even a little derailment anywhere has drastic effects on expected utility, and this is how that environment exerts its tyrannical grasp on the decision of the agent no matter the situation the agent is in. No matter how many gigabytes of data the agent digests, it still thinks that its decision has lots of influence on what happens at the start of time. Best to play it safe and stick to the initial plan.
And so, we might demand a sort of fairness condition, or no-traps assumption, that’s something like “the importance measure at any particular spot has to be like a probability distribution in the way its sum is 1 or less over all the other spots. It can’t be high everywhere, or else everywhere will be affected by the tyranny of having to optimize that particular spot”.
As it turns out, strong locally affine environments fulfill this particular flavor of no-trap assumption, while locally affine environments in general might not. Intuitively, the reason why is that, if the influence measure is large enough for some situation h, there will always be some policy that forces the probability of observations after h to be extreme enough that it isn’t an actual probability distribution, just a vector whose entries sum up to 1. If decisions everywhere matter greatly for whether some observation happens, there’s a policy that forces the probability of that observation to be above 1, or negative. Therefore, if there’s a guarantee that the probability of upcoming observations is always an actual probability distribution, the total influence measure has to be low-ish.
As it turns out, this is dramatically stronger than what you need, and also actually slighly weaker than what you need. Influence measure can sum up to infinity (for infinite time horizon) without too many ill consequences, as long as it has that “frequency limiting to 0 everywhere” property. The property which strong locally affine environments give you is a bit closer to “the expected number of times you defer to your past self is 1 or less”. For histories which you think are very improbable, the frequency of deferring to your past self might remain high forever.
So, yeah, influence measures are THE natural mathematical tool for analyzing “will UDT lock in stupid precommitments forever” questions.
Affineness Failure and the Surprise Game
Just because you have affineness for “probability of next observation”, doesn’t mean that you have affineness for expected utilities, which is what you need for UDT1.0 to be optimal.
Let’s define the Surprise Game, to give a toy example.
You can choose to recieve a safe reward of 0.5 utility, or you can go fight a monster for resources. If you catch it unprepared, you can win the fight for 1 utility. If you catch it prepared, you lose the fight for 0 utility. The probability of the monster being prepared is the same as the probability of you picking the fight with the monster.
It’s locally affine in probabilities. The probability of safe reward vs fight varies linearly with your probabilities of action. The probability of winning vs losing varies linearly with your probabilities of action. However, the overall utility you get varies nonlinearly with the probability of picking a fight. If p is the probability of you fighting the monster, then your utility is −p2+0.5p+0.5. This has an optimum at a 1-in-4 chance of fighting the monster, for 916 utility.
So, even though everything’s locally affine, the overall problem you’re in doesn’t have expected utility being affine w.r.t. your actions, and fancier variants of this can have your optimal-choice-of-action-at-a-spot get entangled with what your policy is doing somewhere else. So UDT1.0 isn’t optimal even if you assume local affineness.
This sort of setup, where what you’re doing elsewhere does matter, recurs again and again. If you know a “baseline” overall policy in a general policy selection environment, it lets you differentiate and get a locally affine environment. If you’re in Vanessa’s InfraBayes setting, the “dynamically consistent update” that gets “updateless” behavior correct requires that you know how your policy behaves in the branches you didn’t end up in. If you’re choosing between a safe option and one that gives you a difficult-to-exploit opportunity, the best choice depends on how sure you are that you’ll exploit that opportunity. If you’re in a Parfit’s Hitchhiker scenario, technically your decision about whether to repay your ride back to town depends on the odds you’d assign to surviving the desert and becoming king of some awesome desert tribe.
Even though, for single observations, you can ignore what the other variants of you are doing, that doesn’t mean you can ignore your overall policy in the broader picture.
So we’re back to UDT1.1, the “compute the optimal policy at the start of time” option, right? We’d need affineness in expected utilities for UDT1.0 to be optimal. And we don’t have that. In fancier variants of the Surprise Game, the equation giving expected utility for probabilities of actions is gonna be some ridiculous polynomial. And then you’ve gotta find the global optimum of that polynomial.
Policy Gradient Ascent
Correct. Which is why we’ll set our sights a little bit lower than that. We’ll just figure out how to gradient-ascent to a pretty good policy. If there’s a big multi-player game going on, where everyone in the game (the yous in different circumstances) has the same utility function, but can’t coordinate too well, and can make observations and guesses about what everyone else will do (what the global policy is), and try to act to repeatedly make things a little better than what they expect will happen and know that everyone else is doing that too… It might be enough. Policy space is high-dimensional. Gradient stuff can be expected to work well in practice.
So let’s see how UDT1.0 would handle it. It’s important to note that that the exact same trick of “if it’s not affine, take a derivative and it suddenly is!” still works on our (non-affine) function mapping policies to expected utilities! If we’ve got a baseline guess at what we do in various different circumstances, we should be able to go “alright, if I do a little bit more of this action here, it has these effects on expected utility...”
So, given a “baseline policy” (probability of fighting the monster), we take a derivative, and figure out whether it’s better to play a tiny little bit more of “go fight it”, or “play it safe”. Basically, gradient ascent in action space, wrt expected utility. If you’re in a situation, figure out the effects of playing a little bit more/less of various actions than you’d otherwise do by default. Your estimate of your global policy moves a bit more in that direction. Repeat.
Of course, you might not hit the global optimum this way, but if there’s a ridiculously huge and high-dimensional policy-space to wander through, then gradient ascent might be expected to get somewhere pretty dang good.
Really Dumb Gradient Ascent
But there’s another option which looks dramatically more idiotic and actually manages to do surprisingly well. And it involves computing the gradient once, instead of many times. And it’s even deterministic instead of probabilistic! No need for random numbers to decide what to do, if you don’t have them.
It is as follows. Compute the gradient once and deterministically play the action the gradient most recommends you playing more of. If it looks better to play a tiny little bit more of “fight the monster”, then just go fight the monster!
Huh?
Well, here’s why it works. You’ll be in a situation where, if you think you have a less than 1-in-4 chance of fighting the monster, you deterministically pick a fight, and if you think it’s more than 1-in-4, you deterministically play it safe.
This is really similar to those diagonalization sentences in logical inductors, like “I have probability less than 1-in-4”. If the probability is less than that, you wanna bet on the sentence. If it’s more than that, you wanna bet against the sentence. And so your beliefs converge to hovering right around the 1-in-4 probability, and a quarter of the time, it’s lower, and three quarters of the time, it’s higher, and you don’t know what you’re going to do. Basically, deterministically randomizing.
So, if you’re, say, using a logical inductor to predict the probabilities of various histories, and expected utilities, and what you’d do in various circumstances, the inductor traders will go “hang on, if this-and-such action is the best one to play a bit more of, it’ll get played, which will alter the history distribution and other stuff… which would then mean it isn’t the most highly recommended action… hang on, we’ve gotta have beliefs where, taking into account that the agent will pick the best action given those beliefs, those beliefs are accurate”, and you should end up basically randomizing at a local optimum in policy space if the traders are competent/your logical inductor is in good working order.
So, our version of UDT1.0 suitable for small dumb agents will probably be something like “have beliefs about the probability distribution over history, and your own policy, and how you (acausally, and retroactively, and causally) influence events, and from this work out the best action to play a tiny little bit more of where you are. Then just do that action deterministically”. This is because, if the belief part is in good working order, this process should home in on local optima in policy space. Which is really really high-dimensional, so the usual arguments about gradient ascent not getting stuck should hold, and it might be expected to perform well in practice.
Takeaways
We can guess at which sorts of environments are suitable for UDT, ie, make simplifying assumptions to get beyond the standard double-exponential bullshit. The one that looked good was that the function mapping policies to probabilities of observations should always be affine. This cuts down the complexity of environments from double-exponential to exponential, with similar savings in time complexity of figuring out what to do. Thus, for bounded agents, phenomena appearing in this more restricted setting are worthy of study and likely to apply to them. Also, general environments can be differentiated (if you know your policy) to cram them into the locally-affine environment setting.
This local affineness assumption lets you formalize a notion called “influence measures”, which tell you how much a decision in one circumstance affects what happens in another circumstance. The notion of “influence measures” is absolutely essential for formalizing the standard worries about UDT locking in stupid behavior at the start of time, and seeing whether they hold or not, and formulating desiderata related to that.
However, local affineness in probabilities doesn’t get you the sort of affineness you’d need for UDT1.0 to be globally optimal, as evidenced by the toy “Surprise Game” we looked at. So we may try to do a sort of gradient ascent in policy space, to get a “good enough” solution with the aid of the high-dimensional nature of policy space. In fact, even a really dumb form of gradient ascent (namely, “figure out which action is best to play a little bit more of, where you are, and then deterministically play that”), can be expected to ascend up to locally optimal behavior over time as long as the epistemics are doing alright and realize this is happening.