Richard_Ngo comments on Richard Ngo’s Shortform

Richard_Ngo Mar 20, 2024, 6:55 AM
7 points
0
In UDT2, when you’re in epistemic state Y and you need to make a decision based on some utility function U, you do the following:
1. Go back to some previous epistemic state X and an EDT policy (the combination of which I’ll call the non-updated agent).
2. Spend a small amount of time trying to find the policy P which maximizes U based on your current expectations X.
3. Run P(Y) to make the choice which maximizes U.

The non-updated agent gets much less information than you currently have, and also gets much less time to think. But it does use the same utility function. That seems… suspicious. If you’re updating so far back that you don’t know who or where you are, how are you meant to know what you care about?
What happens if the non-updated agent doesn’t get given your utility function? On its face, that seems to break its ability to decide which policy P to commit to. But perhaps it could instead choose a policy P(Y,U) which takes as input not just an epistemic state, but also a utility function. Then in step 2, the non-updated agent needs to choose a policy P that maximizes, not the agent’s current utility function, but rather the utility functions it expects to have across a wide range of future situations.
Problem: this involves aggregating the utilities of different agents, and there’s no canonical way to do this. Hmm. So maybe instead of just generating a policy, the non-updated agent also needs to generate a value learning algorithm, that maps from an epistemic state Y to a utility function U, in a way which allows comparison across different Us. ~~Then the non-updated agent tries to find a pair (P, V) such that P(Y) maximizes V(Y) on the distribution of Ys predicted by X.~~ EDIT: no, this doesn’t work. Instead I think you need to go back, not just to a previous epistemic state X, but also to a previous set of preferences U’ (which include meta-level preferences about how your values evolve). Then you pick P and V in order to maximize U’.

Now, it does seem kinda wacky that the non-updated agent can maybe just tell you to change your utility function. But is that actually any weirder than it telling you to change your policy? And after all, you did in fact acquire your values from somewhere, according to some process.
Overall, I haven’t thought about this very much, and I don’t know if it’s already been discussed. But three quick final comments:
1. This brings UDT closer to an ethical theory, not just a decision theory.
2. In practice you’d expect P and V to be closely related. In fact, I’d expect them to be inseparable, based on arguments I make here.
3. Overall the main update I’ve made is not that this version of UDT is actually useful, but that I’m now suspicious of the whole framing of UDT as a process of going back to a non-updated agent and letting it make commitments.
- Martín Soto Mar 20, 2024, 10:53 PM
  5 points
  2
  Parent
  People back then certainly didn’t think of changing preferences.
  Also, you can get rid of this problem by saying “you just want to maximize the variable U”. And the things you actually care about (dogs, apples) are just “instrumentally” useful in giving you U. So for example, it is possible in the future you will learn dogs give you a lot of U, or alternatively that apples give you a lot of U.
  Needless to say, this “instrumentalization” of moral deliberation is not how real agents work. And leads to getting Pascal’s mugged by the world in which you care a lot about easy things.
  It’s more natural to model U as a logically uncertain variable, freely floating inside your logical inductor, shaped by its arbitrary aesthetic preferences. This doesn’t completely miss the importance of reward in shaping your values, but it’s certainly very different to how frugally computable agents do it.
  I simply think the EV maximization framework breaks here. It is a useful abstraction when you already have a rigid enough notion of value, and are applying these EV calculations to a very concrete magisterium about which you can have well-defined estimates.
  Otherwise you get mugged everywhere. And that’s not how real agents behave.
  - Richard_Ngo Mar 20, 2024, 11:48 PM
    2 points
    0
    Parent
    Also, you can get rid of this problem by saying “you just want to maximize the variable U”. And the things you actually care about (dogs, apples) are just “instrumentally” useful in giving you U.
    But you need some mechanism for actually updating your beliefs about U, because you can’t empirically observe U. That’s the role of V.
    leads to getting Pascal’s mugged by the world in which you care a lot about easy things
    I think this is fine. Consider two worlds:
    In world L, lollipops are easy to make, and paperclips are hard to make.
    In world P, it’s the reverse.
    Suppose you’re a paperclip-maximizer in world L. And a lollipop-maximizer comes up to you and says “hey, before I found out whether we were in L or P, I committed to giving all my resources to paperclip-maximizers if we were in P, as long as they gave me all their resources if we were in L. Pay up.”
    UDT says to pay here—but that seems basically equivalent to getting “mugged” by worlds where you care about easy things.
    - Martín Soto Mar 20, 2024, 11:57 PM
      1 point
      0
      Parent
      But you need some mechanism for actually updating your beliefs about U
      Yep, but you can just treat it as another observation channel into UDT. You could, if you want, treat it as a computed number you observe in the corner of your eye, and then just apply UDT maximizing U, and you don’t need to change UDT in any way.
      UDT says to pay here
      (Let’s not forget this depends on your prior, and we don’t have any privileged way to assign priors to these things. But that’s a tangential point.)
      I do agree that there’s not any sharp distinction between situations where it “seems good” and situations where it “seems bad” to get mugged. After all, if all you care about is maximizing EV, then you should take all muggings. It’s just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go “hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don’t really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal”. You start to see how your abstractions might break, and how you can’t get any satisfying notion of “complete updatelessness” (that doesn’t go against important intuitions). And you start to rethink whether this is what we normatively want, nor what we realistically see in agents.
      - Richard_Ngo Mar 21, 2024, 12:12 AM
        2 points
        0
        Parent
        Yep, but you can just treat it as another observation channel into UDT.
        Hmm, I’m confused by this. Why should we treat it this way? There’s no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm. That’s the role V is playing.
        It’s just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go “hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don’t really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal”.
        Obviously I am not arguing that you should agree to all moral muggings. If a pain-maximizer came up to you and said “hey, looks like we’re in a world where pain is way easier to create than pleasure, give me all your resources”, it would be nuts to agree, just like it would be nuts to get mugged by “1+1=3″. I’m just saying that “sometimes you get mugged” is not a good argument against my position, and definitely doesn’t imply “you get mugged everywhere”.
        Martín Soto Mar 21, 2024, 12:42 AM
        1 point
        0
        Parent
        There’s no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm.
        Yes, absolutely! I just meant that, once you give me whatever V you choose to derive U from observations, I will just be able to apply UDT on top of that. So under this framework there doesn’t seem to be anything new going on, because you are just choosing an algorithm V at the start of time, and then treating its outputs as observations. That’s, again, why this only feels like a good model of “completely crystallized rigid values”, and not of “organically building them up slowly, while my concepts and planner module also evolve, etc.”.^[1]
        definitely doesn’t imply “you get mugged everywhere”
        Wait, but how does your proposal differ from EV maximization (with moral uncertainty as part of the EV maximization itself, as I explain above)?
        Because anything that is doing pure EV maximization “gets mugged everywhere”. Meaning if you actually have the beliefs (for example, that the world where suffering is hard to produce could exist), you just take those bets.
        Of course if you don’t have such “extreme” beliefs it doesn’t, but then we’re not talking about decision-making, and instead belief-formation. You could say “I will just do EV maximization, but never have extreme beliefs that lead to suspiciously-looking behavior”, but that’d be hiding the problem under belief-formation, and doesn’t seem to be the kind of efficient mechanism that agents really implement to avoid these failure modes.
        ^
        To be clear, V can be a very general algorithm (like “run a copy of me thinking about ethics”), so that this doesn’t “feel like” having rigid values. Then I just think you’re carving reality at the wrong spot. You’re ignoring the actual dynamics of messy value formation, hiding them under V.
- Vladimir_Nesov Mar 20, 2024, 8:07 PM
  4 points
  0
  Parent
  In times of UDT2, the background assumption was that agents should maintain an unchanging preference, which is separate from knowledge. One motivation for UDT is that updating makes an agent stop caring about updated-away possibilities, while UDT is not doing that. Going back to a previous epistemic state is a way of preserving preference from that epistemic state, the “current” utility function is considered a bug and doesn’t do anything if UDT is adopted. The non-updated agent can in principle consider the information you currently have as one of the possibilities when formulating the general policy for all possibilities, though being bounded it won’t do a very good job.
  
  Traditionally UDT1.1 wants to make its decisions from very little knowledge and to apply the policy to all always. A more pragmatic thing is to make decisions from modestly less knowledge and to scope the policy for middle-term future. Some form of this is useful for many thought experiments where the environment or other players also have the little knowledge our agent uses to make its decisions from the past, and so could know the policy the agent decides on before they need to prepare for it or make predictions about it.
  
  The problem is commitment races (as in the game of chicken), where everyone wants to decide earlier and force the others to respond. But there is a need to remain bounded in making decisions, both to personally compute them and to make it possible for others to anticipate them and to coordinate. This creates a more reasonable equilibrium, motivating decisions from a less ignorant epistemic state that have a better chance of being relevant to the current situation, in balance with trying to decide from a more ignorant epistemic state where a general policy would enable more strategicness across possibilities. UDT1.1 can’t find such balance, but it’s possible that something UDT2-shaped might.
  - Richard_Ngo Mar 20, 2024, 8:21 PM
    2 points
    0
    Parent
    One motivation for UDT is that updating makes an agent stop caring about updated-away possibilities, while UDT is not doing that.
    I think there’s an ambiguity here. UDT makes the agent stop considering updated-away possibilities, but I haven’t seen any discussion of UDT which suggests that it stops caring about them in principle (except for a brief suggestion from Paul that one option for UDT is to “go back to a position where I’m mostly ignorant about the content of my values”). Rather, when I’ve seen UDT discussed, it focuses on updating or un-updating your epistemic state.
    I don’t think the shift I’m proposing is particularly important, but I do think the idea that “you have your prior and your utility function from the very beginning” is a kinda misleading frame to be in, so I’m trying to nudge a little away from that.
    - Vladimir_Nesov Mar 20, 2024, 8:45 PM
      2 points
      0
      Parent
      
      UDT makes the agent stop considering updated-away possibilities, but I haven’t seen any discussion of UDT which suggests that it stops caring about them in principle
      
      UDT specifically enables agents to consider the updated-away possibilities in a way relevant to decision making, while an updated agent (that’s not using something UDT-like) wouldn’t be able to do that in any circumstance, and so would be functionally indistinguishable from an agent that has different preferences or undefined preferences for those possibilities. Not caring about them seems like an apt informal description (even as this is compatible with keeping the same utility function outside the event of current knowledge). In a similar way, we could say that after updating, an agent either changes their probability distribution or keeps the original prior.
      
      I do think the idea that “you have your prior and your utility function from the very beginning” is a kinda misleading frame to be in
      
      Historically it was overwhelmingly the frame until recently, so it’s the correct frame for interpreting the intended meaning of texts from that time. This is a simplifying assumption that still leaves many open questions about how to make decisions in sufficiently strange situations (where merely models of behavior make these strange situations ubiquitous in practice). When an agent doesn’t know its own preference and needs to do something about that, it’s an additional complication that usually wasn’t introduced.
      - Richard_Ngo Mar 20, 2024, 8:59 PM
        2 points
        0
        Parent
        UDT specifically enables agents to consider the updated-away possibilities in a way relevant to decision making, while an updated agent (that’s not using something UDT-like) wouldn’t be able to do that in any circumstance
        Agreed; apologies for the sloppy phrasing.
        Historically it was overwhelmingly the frame until recently, so it’s the correct frame for interpreting the intended meaning of texts from that time.
        I agree, that’s why I’m trying to outline an alternative frame for thinking about it.
- Richard_Ngo Mar 20, 2024, 11:34 PM
  2 points
  0
  Parent
  Some more thoughts: we can portray the process of choosing a successor policy as the iterative process of making more and more commitments over time. But what does it actually look like to make a commitment? Well, consider an agent that is made of multiple subagents, that each get to vote on its decisions. You can think of a commitment as basically saying “this subagent still gets to vote, but no longer gets updated”—i.e. it’s a kind of stop-gradient.
  Two interesting implications of this perspective:
  1. The “cost” of a commitment can be measured both in terms of “how often does the subagent vote in stupid ways?”, and also “how much space does it require to continue storing this subagent?” But since we’re assuming that agents get much smarter over time, probably the latter is pretty small.
  2. There’s a striking similarity to the problem of trapped priors in human psychology. Parts of our brains basically are subagents that still get to vote but no longer get updated. And I don’t think this is just a bug—it’s also a feature. This is true on the level of biological evolution (you need to have a strong fear of death in order to actually survive) and also on the level of cultural evolution (if you can indoctrinate kids in a way that sticks, then your culture is much more likely to persist).
    
    The (somewhat provocative) way of phrasing this is that trauma is evolution’s approach to implementing UDT. Someone who’s been traumatized into conformity by society when they were young will then (in theory) continue obeying society’s dictates even when they later have more options. Someone who gets very angry if mistreated in a certain way is much harder to mistreat in that way. And of course trauma is deeply suboptimal in a bunch of ways, but so too are UDT commitments, because they were made too early to figure out better alternatives.
    
    This is clearly only a small component of the story but the analogy is definitely a very interesting one.
  - Richard_Ngo Mar 22, 2024, 11:42 PM
    5 points
    0
    Parent
    More thoughts: what’s the difference between paying in a counterfactual mugging based on:
    Whether the millionth digit of pi (5) is odd or even
    Whether or not there are an infinite number of primes?
    In the latter case knowing the truth is (near-)inextrictably entangled with a bunch of other capabilities, like the ability to do advanced mathematics. Whereas in the former it isn’t. Suppose that before you knew either fact you were told that one of them was entangled in this way—would you still want to commit to paying out in a mugging based on it?
    Well… maybe? But it means that the counterlogical of “if there hadn’t been an infinite number of primes” is not very well-defined—it’s hard to modify your brain to add that belief without making a bunch of other modifications. So now Omega doesn’t just have to be (near-)omniscient, it also needs to have a clear definition of the counterlogical that’s “fair” according to your standards; without knowing that it has that, paying up becomes less tempting.
    - Vladimir_Nesov Mar 23, 2024, 4:20 AM
      3 points
      0
      Parent
      Individually logical counterfactuals don’t seem very coherent. This is related to the “I’m an algorithm” vs. “I’m a physical object” distinction of FDT. When you are an algorithm considering a decision, you want to mark all sites of intervention/influence in the world where the world depends on your behavior. If you only mark some of them, then you later fail at the step where you ask what happens if you act differently, you obtain a broken counterfactual world where only some instances of the fact of your behavior have been replaced and not others.
      
      So I think it makes a bit more sense to ask where specifically your brain depends on a fact, to construct an exhausive dependence of your brain on the fact, before turning to particular counterfactual content for that fact to be replaced with. That is, dependence of a system on a fact, the way it varies with the fact, seems potentially clearer than individual counterfactuals of how that system works if the fact is set to be a certain way. (To make a somewhat hopeless analogy, fibration instead of individual fibers, and it shouldn’t be a problem that all fibers are different from each other. Any question about a counterfactual should be reformulated into a question about a dependence.)