Thane Ruthenis comments on Value systematization: how values become coherent (and misaligned)

Thane Ruthenis 11 Jan 2024 19:10 UTC
LW: 2 AF: 1
0
AF
Let me list some ways in which it could change:
If I recall correctly, the hypothetical under consideration here involved an agent with an already-perfect world-model, and we were discussing how value translation up the abstraction levels would work in it. That artificial setting was meant to disentangle the “value translation” phenomenon from the “ontology crisis” phenomenon.

Shifts in the agent’s model of what counts as “a gear” or “spinning” violate that hypothetical. And I think they do fall under the purview of ontology-crisis navigation.

Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn’t forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear’s role in the whole mechanism)?
I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that’s a separate topic. I think that, if you have some value $v (x)$ , and it doesn’t run into direct conflict with any other values you have, and your model of $x$ isn’t wrong at the abstraction level it’s defined at, you’ll never want to change $v (x)$ .
You might realize that your mental pointer to the gear you care about identified it in terms of its function not its physical position
That’s the closest example, but it seems to be just an epistemic mistake? Your value is well-defined over “the gear that was driving the piston”. After you learn it’s a different gear from the one you thought, that value isn’t updated: you just naturally shift it to the real gear.
Plainer example: Suppose you have two bank account numbers at hand, A and B. One belongs to your friend, another to a stranger. You want to wire some money to your friend, and you think A is their account number. You prepare to send the money… but then you realize that was a mistake, and actually your friend’s number is B, so you send the money there. That didn’t involve any value-related shift.
I’ll try again to make the human example work. Suppose you love your friend, and your model of their personality is accurate – your model of what you value is correct at the abstraction level at which “individual humans” are defined. However, there are also:
1. Some higher-level dynamics you’re not accounting for, like the impact your friend’s job has on the society.
2. Some lower-level dynamics you’re unaware of, like the way your friend’s mind is implemented at the levels of cells and atoms.
My claim is that, unless you have terminal preferences over those other levels, then learning to model these higher- and lower-level dynamics would have no impact on the shape of your love for your friend.
Granted, that’s an unrealistic scenario. You likely have some opinions on social politics, and if you learned that your friend’s job is net-harmful at the societal level, that’ll surely impact your opinion of them. Or you might have conflicting same-level preferences, like caring about specific other people, and learning about these higher-level societal dynamics would make it clear to you that your friend’s job is hurting them. Less realistically, you may have some preferences over cells, and you may want to… convince your friend to change their diet so that their cellular composition is more in-line with your aesthetic, or something weird like that.
But if that isn’t the case – if your value is defined over an accurate abstraction and there are no other conflicting preferences at play – then the mere fact of putting it into a lower- or higher-level context won’t change it.
Much like you’ll never change your preferences over a gear’s rotation if your model of the mechanism at the level of gears was accurate – even if you were failing to model the whole mechanism’s functionality or that gear’s atomic composition.
(I agree that it’s a pretty contrived setup, but I think it’s very valuable to tease out the specific phenomena at play – and I think “value translation” and “value conflict resolution” and “ontology crises” are highly distinct, and your model somewhat muddles them up.)
1. ^
  Although there may be higher-level dynamics you’re not tracking, or lower-level confusions. See the friend example below.
- Richard_Ngo 12 Jan 2024 2:24 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn’t forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear’s role in the whole mechanism)?
  I think some of my examples do this. E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating, and it’s fine if the specific gear gets swapped out for a copy. I’m not assuming there’s a mistake anywhere, I’m just assuming you switch from caring about one type of property it has (physical) to another (functional).
  In general, in the higher-abstraction model each component will acquire new relational/functional properties which may end up being prioritized over the physical properties it had in the lower-abstraction model.
  I picture you saying “well, you could just not prioritize them”. But in some cases this adds a bunch of complexity. E.g. suppose that you start off by valuing “this particular gear”, but you realize that atoms are constantly being removed and new ones added (implausibly, but let’s assume it’s a self-repairing gear) and so there’s no clear line between this gear and some other gear. Whereas, suppose we assume that there is a clear, simple definition of “the gear that moves the piston”—then valuing that could be much simpler.
  Zooming out: previously you said
  I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that’s a separate topic. I think that, if you have some value $v (x)$ , and it doesn’t run into direct conflict with any other values you have, and your model of $x$ isn’t wrong at the abstraction level it’s defined at, you’ll never want to change $v (x)$ .
  I’m worried that we’re just talking about different things here, because I totally agree with what you’re saying. My main claims are twofold. First, insofar as you value simplicity (which I think most agents strongly do) then you’re going to systematize your values. And secondly, insofar as you have an incomplete ontology (which every agent does) and you value having well-defined preferences over a wide range of situations, then you’re going to systematize your values.
  Separately, if you have neither of these things, you might find yourself identifying instrumental strategies that are very abstract (or very concrete). That seems fine, no objections there. If you then cache these instrumental strategies, and forget to update them, then that might look very similar to value systematization or concretization. But it could also look very different—e.g. the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations. So I think there are two separate things going on here.
  - Thane Ruthenis 12 Jan 2024 3:07 UTC
    LW: 2 AF: 1
    0
    AF Parent
    E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating
    That seems more like value reflection, rather than a value change?
    The way I’d model it is: you have some value $v (x)$ , whose implementations you can’t inspect directly, and some guess about what it is $P (v (x))$ . (That’s how it often works in humans: we don’t have direct knowledge of how some of our values are implemented.) Before you were introduced to the question $Q$ of “what if we swap the gear for a different one: which one would you care about then?”, your model of that value put the majority of probability mass on $v_{1} (x)$ , which was “I value this particular gear”. But upon considering $Q$ , your PD over $v (x)$ changed, and now it puts most probability on $v_{2} (x)$ , defined as “I care about whatever gear is moving the piston”.
    Importantly, that example doesn’t seem to involve any changes to the object-level model of the mechanism? Just the newly-introduced possibility of switching the gear. And if your values shift in response to previously-unconsidered hypotheticals (rather than changes to the model of the actual reality), that seems to be a case of your learning about your values. Your model of your values changing, rather than them changing directly.
    (Notably, that’s only possible in scenarios where you don’t have direct access to your values! Where they’re black-boxed, and you have to infer their internals from the outside.)
    the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations
    Sounds right, yep. I’d argue that translating a value up the abstraction levels would almost surely lead to simpler cached strategies, though, just because higher levels are themselves simpler. See my initial arguments.
    insofar as you value simplicity (which I think most agents strongly do) then you’re going to systematize your values
    Sure, but: the preference for simplicity needs to be strong enough to overpower the object-level values it wants to systematize, and it needs to be stronger than them the more it wants to shift them. The simplest values are no values, after all.
    I suppose I see what you’re getting at here, and I agree that it’s a real dynamic. But I think it’s less important/load-bearing to how agents work than the basic “value translation in a hierarchical world-model” dynamic I’d outlined. Mainly because it routes through the additional assumption of the agent having a strong preference for simplicity.
    And I think it’s not even particularly strong in humans? “I stopped caring about that person because they were too temperamental and hard-to-please; instead, I found a new partner who’s easier to get along with” is something that definitely happens. But most instances of value extrapolation aren’t like this.