Thane Ruthenis comments on Value systematization: how values become coherent (and misaligned)

Thane Ruthenis Oct 28, 2023, 9:20 PM
LW: 4 AF: 2
−2
AF
I actually think this type of change is very common—because individuals’ identities are very strongly interwoven with the identities of the groups they belong to
Mm, I’ll concede that point. I shouldn’t have used people as an example; people are messy.
Literal gears, then. Suppose you’re studying some massive mechanism. You find gears in it, and derive the laws by which each individual gear moves. Then you grasp some higher-level dynamics, and suddenly understand what function a given gear fulfills in the grand scheme of things. But your low-level model of a specific gear’s dynamics didn’t change — locally, it was as correct as it could ever be.
And if you had a terminal utility function over that gear (e. g., “I want it to spin at the rate of 1 rotation per minutes”), that utility function won’t change in the light of your model expanding, either. Why would it?
the whole point of deontology is rejecting the utility-maximization framework. Instead, deontologists have a bunch of rules and heuristics
… which can be represented as utility functions. Take a given deontological rule, like “killing is bad”. Let’s say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that “predicts” that you’re very likely/unlikely to take specific actions. Probability distributions of this form could be transformed into utility functions by reverse-softmaxing them; thus, it’s perfectly coherent to model a deontologist as an agent with a lot of separate utility functions.
See Friston’s predictive-processing framework in neuroscience, plus this (and that comment).
Deontologists reject utility-maximization in the sense that they refuse to engage in utility-maximizing calculations using their symbolic intelligence, but similar dynamics are still at play “under the hood”.
It seems like your answer is basically just that it’s a design flaw in the human brain, of allowing value drift
Well, not a flaw as such; a design choice. Humans are trained in an on-line regime, our values are learned from scratch, and… this process of active value learning just never switches off (although it plausibly slows down with age, see old people often being “set in their ways”). Our values change by the same process by which they were learned to begin with.
- Mo Putera Dec 28, 2023, 1:08 PM
  3 points
  0
  Parent
  Tangentially:
  See Friston’s predictive-processing framework in neuroscience
  Nostalgebraist has argued that Friston’s ideas here are either vacuous or a nonstarter, in case you’re interested.
  - Thane Ruthenis Dec 28, 2023, 10:49 PM
    1 point
    0
    Parent
    Yeah, I’m familiar with that view on Friston, and I shared it for a while. But it seems there’s a place for that stuff after all. Even if the initial switch to viewing things probabilistically is mathematically vacuous, it can still be useful: if viewing cognition in that framework makes it easier to think about (and thus theorize about).
    Much like changing coordinates from Cartesian to polar is “vacuous” in some sense, but makes certain problems dramatically more straightforward to think through.
- Richard_Ngo Jan 11, 2024, 6:17 PM
  LW: 2 AF: 2
  0
  AF Parent
  (drafted this reply a couple months ago but forgot to send it, sorry)
  your low-level model of a specific gear’s dynamics didn’t change — locally, it was as correct as it could ever be.
  And if you had a terminal utility function over that gear (e. g., “I want it to spin at the rate of 1 rotation per minutes”), that utility function won’t change in the light of your model expanding, either. Why would it?
  Let me list some ways in which it could change:
  - Your criteria for what counts as “the same gear” changes as you think more about continuity of identity over time. Once the gear stars wearing down, this will affect what you choose to do.
  - After learning about relativity, your concepts of “spinning” and “minutes” change, as you realize they depend on the reference frame of the observer.
  - You might realize that your mental pointer to the gear you care about identified it in terms of its function not its physical position. For example, you might have cared about “the gear that was driving the piston continuing to rotate”, but then realize that it’s a different gear that’s driving the piston than you thought.
  These are a little contrived. But so too is the notion of a value that’s about such a basic phenomenon as a single gear spinning. In practice almost all human values are (and almost all AI values will be) focused on much more complex entities, where there’s much more room for change as your model expands.
  Take a given deontological rule, like “killing is bad”. Let’s say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that “predicts” that you’re very likely/unlikely to take specific actions. Probability distributions of this form could be transformed into utility functions by reverse-softmaxing them; thus, it’s perfectly coherent to model a deontologist as an agent with a lot of separate utility functions.
  This doesn’t actually address the problem of underspecification, it just shuffles it somewhere else. When you have to choose between two bad things, how do you do so? Well, it depends on which probability distributions you’ve chosen, which have a number of free parameters. And it depends very sensitively on free parameters, because the region where two deontological rules clash is going to be a small proportion of your overall distribution.
  - Thane Ruthenis Jan 11, 2024, 7:10 PM
    LW: 2 AF: 1
    0
    AF Parent
    Let me list some ways in which it could change:
    If I recall correctly, the hypothetical under consideration here involved an agent with an already-perfect world-model, and we were discussing how value translation up the abstraction levels would work in it. That artificial setting was meant to disentangle the “value translation” phenomenon from the “ontology crisis” phenomenon.
    
    Shifts in the agent’s model of what counts as “a gear” or “spinning” violate that hypothetical. And I think they do fall under the purview of ontology-crisis navigation.
    
    Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn’t forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear’s role in the whole mechanism)?
    I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that’s a separate topic. I think that, if you have some value $v (x)$ , and it doesn’t run into direct conflict with any other values you have, and your model of $x$ isn’t wrong at the abstraction level it’s defined at, you’ll never want to change $v (x)$ .
    You might realize that your mental pointer to the gear you care about identified it in terms of its function not its physical position
    That’s the closest example, but it seems to be just an epistemic mistake? Your value is well-defined over “the gear that was driving the piston”. After you learn it’s a different gear from the one you thought, that value isn’t updated: you just naturally shift it to the real gear.
    Plainer example: Suppose you have two bank account numbers at hand, A and B. One belongs to your friend, another to a stranger. You want to wire some money to your friend, and you think A is their account number. You prepare to send the money… but then you realize that was a mistake, and actually your friend’s number is B, so you send the money there. That didn’t involve any value-related shift.
    I’ll try again to make the human example work. Suppose you love your friend, and your model of their personality is accurate – your model of what you value is correct at the abstraction level at which “individual humans” are defined. However, there are also:
    Some higher-level dynamics you’re not accounting for, like the impact your friend’s job has on the society.
    Some lower-level dynamics you’re unaware of, like the way your friend’s mind is implemented at the levels of cells and atoms.
    My claim is that, unless you have terminal preferences over those other levels, then learning to model these higher- and lower-level dynamics would have no impact on the shape of your love for your friend.
    Granted, that’s an unrealistic scenario. You likely have some opinions on social politics, and if you learned that your friend’s job is net-harmful at the societal level, that’ll surely impact your opinion of them. Or you might have conflicting same-level preferences, like caring about specific other people, and learning about these higher-level societal dynamics would make it clear to you that your friend’s job is hurting them. Less realistically, you may have some preferences over cells, and you may want to… convince your friend to change their diet so that their cellular composition is more in-line with your aesthetic, or something weird like that.
    But if that isn’t the case – if your value is defined over an accurate abstraction and there are no other conflicting preferences at play – then the mere fact of putting it into a lower- or higher-level context won’t change it.
    Much like you’ll never change your preferences over a gear’s rotation if your model of the mechanism at the level of gears was accurate – even if you were failing to model the whole mechanism’s functionality or that gear’s atomic composition.
    (I agree that it’s a pretty contrived setup, but I think it’s very valuable to tease out the specific phenomena at play – and I think “value translation” and “value conflict resolution” and “ontology crises” are highly distinct, and your model somewhat muddles them up.)
    ^
    Although there may be higher-level dynamics you’re not tracking, or lower-level confusions. See the friend example below.
    - Richard_Ngo Jan 12, 2024, 2:24 AM
      LW: 3 AF: 2
      0
      AF Parent
      Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn’t forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear’s role in the whole mechanism)?
      I think some of my examples do this. E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating, and it’s fine if the specific gear gets swapped out for a copy. I’m not assuming there’s a mistake anywhere, I’m just assuming you switch from caring about one type of property it has (physical) to another (functional).
      In general, in the higher-abstraction model each component will acquire new relational/functional properties which may end up being prioritized over the physical properties it had in the lower-abstraction model.
      I picture you saying “well, you could just not prioritize them”. But in some cases this adds a bunch of complexity. E.g. suppose that you start off by valuing “this particular gear”, but you realize that atoms are constantly being removed and new ones added (implausibly, but let’s assume it’s a self-repairing gear) and so there’s no clear line between this gear and some other gear. Whereas, suppose we assume that there is a clear, simple definition of “the gear that moves the piston”—then valuing that could be much simpler.
      Zooming out: previously you said
      I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that’s a separate topic. I think that, if you have some value $v (x)$ , and it doesn’t run into direct conflict with any other values you have, and your model of $x$ isn’t wrong at the abstraction level it’s defined at, you’ll never want to change $v (x)$ .
      I’m worried that we’re just talking about different things here, because I totally agree with what you’re saying. My main claims are twofold. First, insofar as you value simplicity (which I think most agents strongly do) then you’re going to systematize your values. And secondly, insofar as you have an incomplete ontology (which every agent does) and you value having well-defined preferences over a wide range of situations, then you’re going to systematize your values.
      Separately, if you have neither of these things, you might find yourself identifying instrumental strategies that are very abstract (or very concrete). That seems fine, no objections there. If you then cache these instrumental strategies, and forget to update them, then that might look very similar to value systematization or concretization. But it could also look very different—e.g. the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations. So I think there are two separate things going on here.
      - Thane Ruthenis Jan 12, 2024, 3:07 AM
        LW: 2 AF: 1
        0
        AF Parent
        E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating
        That seems more like value reflection, rather than a value change?
        The way I’d model it is: you have some value $v (x)$ , whose implementations you can’t inspect directly, and some guess about what it is $P (v (x))$ . (That’s how it often works in humans: we don’t have direct knowledge of how some of our values are implemented.) Before you were introduced to the question $Q$ of “what if we swap the gear for a different one: which one would you care about then?”, your model of that value put the majority of probability mass on $v_{1} (x)$ , which was “I value this particular gear”. But upon considering $Q$ , your PD over $v (x)$ changed, and now it puts most probability on $v_{2} (x)$ , defined as “I care about whatever gear is moving the piston”.
        Importantly, that example doesn’t seem to involve any changes to the object-level model of the mechanism? Just the newly-introduced possibility of switching the gear. And if your values shift in response to previously-unconsidered hypotheticals (rather than changes to the model of the actual reality), that seems to be a case of your learning about your values. Your model of your values changing, rather than them changing directly.
        (Notably, that’s only possible in scenarios where you don’t have direct access to your values! Where they’re black-boxed, and you have to infer their internals from the outside.)
        the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations
        Sounds right, yep. I’d argue that translating a value up the abstraction levels would almost surely lead to simpler cached strategies, though, just because higher levels are themselves simpler. See my initial arguments.
        insofar as you value simplicity (which I think most agents strongly do) then you’re going to systematize your values
        Sure, but: the preference for simplicity needs to be strong enough to overpower the object-level values it wants to systematize, and it needs to be stronger than them the more it wants to shift them. The simplest values are no values, after all.
        I suppose I see what you’re getting at here, and I agree that it’s a real dynamic. But I think it’s less important/load-bearing to how agents work than the basic “value translation in a hierarchical world-model” dynamic I’d outlined. Mainly because it routes through the additional assumption of the agent having a strong preference for simplicity.
        And I think it’s not even particularly strong in humans? “I stopped caring about that person because they were too temperamental and hard-to-please; instead, I found a new partner who’s easier to get along with” is something that definitely happens. But most instances of value extrapolation aren’t like this.