Thane Ruthenis comments on Value systematization: how values become coherent (and misaligned)

Thane Ruthenis 27 Oct 2023 21:39 UTC
LW: 22 AF: 9
−3
AF
I’d previously sketched out a model basically identical to this one, see here and especially here.
… but I’ve since updated away from it, in favour of an even simpler explanation.
The major issue with this model is the assumption that either (1) the SGD/evolution/whatever-other-selection-pressure will always convergently instill the drive for doing value systematization into the mind it’s shaping, or (2) that agents will somehow independently arrive at it on their own; and that this drive will have overwhelming power, enough to crush the object-level values. But why?
I’d had my own explanation, but let’s begin with your arguments. I find them unconvincing.
- First, the reason we expect value compilation/systematization to begin with is because we observe it in humans, and human minds are not trained the way NN models are trained. Moreover, the instances where we note value systematization in humans seem to have very little to do with blind-idiot training algorithms (like the SGD or evolution) at all. Instead, it tends to happen when humans leverage their full symbolic intelligence to do moral philosophy.
  - So the SGD-specific explanations are right out.
- So we’re left with the “the mind itself wants to do this” class of explanations. But it doesn’t really make sense. If you value X, Y, Z, and those are your terminal values, whyever would you choose to rewrite yourself to care about W instead? If W is a “simple generator” of X, Y, Z, that would change… precisely nothing. You care about X, Y, Z, not about W; nothing more to be said. Unless given a compelling external reason to switch to W, you won’t do that.
  - At most you’ll use W as a simpler proxy measure for fast calculations of plans in high-intensity situations. But you’d still periodically “look back” at X, Y, Z to ensure you’re still following them; W would ever remain just an instrumental proxy.
So, again, we need the drive for value systematization to be itself a value, and such a strong one that it’s able to frequently overpower whole coalitions of object-level values. Why would that happen?
My sketch went roughly as follows: Suppose we have a system at an intermediary stage of training. So far it’s pretty dumb; all instinct and no high-level reasoning. The training objective is $U$ ; the system implements a set of contextually-activated shards/heuristics that cause it to engage in contextual behaviors $B_{i}$ . Engaging in any $B_{i}$ is correlated with optimizing for $U$ , but every $B_{i}$ is just that: an “upstream correlate” of $U$ , and it’s only a valid correlate in some specific context. Outside that context, following $B_{i}$ would not lead to optimizing for $U$ ; and the optimizing-for- $U$ behavior is only achieved by a careful balance of the contextual behaviors.
Now suppose we’ve entered the stage of training at which higher-level symbolic intelligence starts to appear. We’ve grown a mesa-optimizer. We now need to point it at something; some goal that’s correlated with $U$ . Problem: we can only point it at goals concepts corresponding to which are present in its world-model, and $U$ might not even be there yet! (Stone-age humans had no idea about inclusive genetic fitness or pleasure-maximization.) We only have a bunch of $B_{i}$ s...
In that case, instilling a drive for value systematization seems like the right thing to do. No $B_{i}$ is a proper proxy for $U$ , but the weighted sum of them, $B_{Σ}$ , is. So that’s what we point our newborn agent at. We essentially task it with figuring out for what purpose it was optimized, and then tell it to then go do that thing. We hard-wire this objective into it, and make it have an overriding priority.
(But of course $B_{Σ}$ is still an imperfect proxy for $U$ , and the agent’s attempts to figure out $B_{Σ}$ are imperfect as well, so it still ends up misaligned from $U$ .)
That story still seems plausible to me. It’s highly convoluted, but it makes sense.
… but I don’t think there’s any need for it.
I think “value systematization” is simply the reflection of the fact that the world can be viewed as a series of hierarchical ever-more-abstract models.
- Suppose we have a low-level model of reality $E^{0}$ , with $n_{0}$ variables (atoms, objects, whatever).
- Suppose we “abstract up”, deriving a more simple model of the world $E^{1}$ , with $n_{1}$ variables. Each variable $e_{i}^{1}$ in it is an abstraction over some set of lower-level variables ${e_{k}^{0}}$ , such that $f ({e_{k}^{0}}) = e_{i}^{1}$ .
- We iterate, to $E^{2}$ , $E^{3}$ , …, $E^{Ω}$ . Caveat: $n_{l} ≪ n_{l - 1}$ . Since each subsequent level is simpler, it contains fewer variables. People to social groups to countries to the civilization; atoms to molecules to macro-scale objects to astronomical objects; etc.
- Let’s define the function $f^{- 1} (e_{i}^{l}) = P ({e_{k}^{l - 1}} | e_{i}^{l})$ . I. e.: it returns a probability distribution over the low-level variables given the state of a high-level variable that abstracts over them. (E. g., if the world economy is in this state, how happy my grandmother is likely to be?)
- If we view our values as an utility function $u$ , we can “translate” our utility function from any $e_{k}^{l - 1}$ to $e_{i}^{l}$ roughly as follows: $u (e_{k}^{l - 1}) \to u^{'} (e_{i}^{l}) = E [u (e_{k}^{l - 1}) | f^{- 1} (e_{i}^{l})]$ . (There’s a ton of complications there, but this expression conveys the concept.)
… and then value systematization just naturally falls out of this.
Suppose we have a bunch of values at $l$ th abstraction level. Once we start frequently reasoning at $(l + 1)$ th level, we “translate” our values to it, and cache the resultant values. Since the $(l + 1)$ th level likely has fewer variables than $l$ th, the mapping-up is not injective: some values defined over different low-level variables end up translated to the same higher-level variable (“I like Bob and Alice” → “I like people”). This effect only strengthens as we go up higher and higher; and at $E^{Ω}$ , we can plausibly end up with only one variable we value (“eudaimonia” or something).
That does not mean we stop caring about our lower-level values. Nay: those translations are still instrumental, we simply often use them to save on processing costs.
… or so it ideally should be. But humans are subject to value drift. An ideal agent would never forget the distinction between the terminal and the instrumental; humans do. And so the more often a given human reasons at larger scales compared to lower scales, the more they “drift” towards higher-level values, such as going from deontology to utilitarianism.
Value concretization is simply the exact same process, but mapping a higher-level value down the abstraction levels.
For me, at least, this explanation essentially dissolves the question of value systematization; I perceive no leftover confusion.
In a very real sense, it can’t work any other way.
What links here?
- Nathan Helm-Burger's comment on Value systematization: how values become coherent (and misaligned) by Richard_Ngo (22 Dec 2023 5:23 UTC; 2 points)
- Thane Ruthenis's comment on Value systematization: how values become coherent (and misaligned) by Richard_Ngo (12 Jan 2024 3:07 UTC; 2 points)
- Richard_Ngo 27 Oct 2023 23:24 UTC
  LW: 13 AF: 6
  2
  AF Parent
  Thanks for the comment! I agree that thinking of minds as hierarchically modeling the world is very closely related to value systematization.
  But I think the mistake you’re making is to assume that the lower levels are preserved after finding higher-level abstractions. Instead, higher-level abstractions reframe the way we think about lower-level abstractions, which can potentially change them dramatically. This is what happens with most scientific breakthroughs: we start with lower-level phenomena, but we don’t understand them very well until we discover the higher-level abstraction.
  For example, before Darwin people had some concept that organisms seemed to be “well-fitted” for their environments, but it was a messy concept entangled with their theological beliefs. After Darwin, their concept of fitness changed. It’s not that they’ve drifted into using the new concept, it’s that they’ve realized that the old concept was under-specified and didn’t really make sense.
  Similarly, suppose you have two deontological values which trade off against each other. Before systematization, the question of “what’s the right way to handle cases where they conflict” is not really well-defined; you have no procedure for doing so. After systematization, you do. (And you also have answers to questions like “what counts as lying?” or “is X racist?”, which without systematization are often underdefined.)
  That’s where the tradeoff comes from. You can conserve your values (i.e. continue to care terminally about lower-level representations) but the price you pay is that they make less sense, and they’re underdefined in a lot of cases. Or you can simplify your values (i.e. care terminally about higher-level representations) but the price you pay is that the lower-level representations might change a lot.
  And that’s why the “mind itself wants to do this” does make sense, because it’s reasonable to assume that highly capable cognitive architectures will have ways of identifying aspects of their thinking that “don’t make sense” and correcting them.
  - Kaj_Sotala 28 Oct 2023 20:42 UTC
    LW: 5 AF: 2
    2
    AF Parent
    Similarly, suppose you have two deontological values which trade off against each other. Before systematization, the question of “what’s the right way to handle cases where they conflict” is not really well-defined; you have no procedure for doing so. After systematization, you do. (And you also have answers to questions like “what counts as lying?” or “is X racist?”, which without systematization are often underdefined.) [...]
    You can conserve your values (i.e. continue to care terminally about lower-level representations) but the price you pay is that they make less sense, and they’re underdefined in a lot of cases. [...] And that’s why the “mind itself wants to do this” does make sense, because it’s reasonable to assume that highly capable cognitive architectures will have ways of identifying aspects of their thinking that “don’t make sense” and correcting them.
    I think we should be careful to distinguish explicit and implicit systematization. Some of what you are saying (e.g. getting answers to question like “what counts as lying”) sounds like you are talking about explicit, consciously done systematization; but some of what you are saying (e.g. minds identifying aspects of thinking that “don’t make sense” and correcting them) also sounds like it’d apply more generally to developing implicit decision-making procedures.
    I could see the deontologist solving their problem either way—by developing some explicit procedure and reasoning for solving the conflict between their values, or just going by a gut feel for which value seems to make more sense to apply in that situation and the mind then incorporating this decision into its underlying definition of the two values.
    I don’t know how exactly deontological rules work, but I’m guessing that you could solve a conflict between them by basically just putting in a special case for “in this situation, rule X wins over rule Y”—and if you view the rules as regions in state space where the region for rule X corresponds to the situations where rule X is applied, then adding data points about which rule is meant to cover which situation ends up modifying the rule itself. It would also be similar to the way that rules work in skill learning in general, in that experts find the rules getting increasingly fine-grained, implicit and full of exceptions. Here’s how Josh Waitzkin describes the development of chess expertise:
    Let’s say that I spend fifteen years studying chess. [...] We will start with day one. The first thing I have to do is to internalize how the pieces move. I have to learn their values. I have to learn how to coordinate them with one another. [...]
    Soon enough, the movements and values of the chess pieces are natural to me. I don’t have to think about them consciously, but see their potential simultaneously with the figurine itself. Chess pieces stop being hunks of wood or plastic, and begin to take on an energetic dimension. Where the piece currently sits on a chessboard pales in comparison to the countless vectors of potential flying off in the mind. I see how each piece affects those around it. Because the basic movements are natural to me, I can take in more information and have a broader perspective of the board. Now when I look at a chess position, I can see all the pieces at once. The network is coming together.
    Next I have to learn the principles of coordinating the pieces. I learn how to place my arsenal most efficiently on the chessboard and I learn to read the road signs that determine how to maximize a given soldier’s effectiveness in a particular setting. These road signs are principles. Just as I initially had to think about each chess piece individually, now I have to plod through the principles in my brain to figure out which apply to the current position and how. Over time, that process becomes increasingly natural to me, until I eventually see the pieces and the appropriate principles in a blink. While an intermediate player will learn how a bishop’s strength in the middlegame depends on the central pawn structure, a slightly more advanced player will just flash his or her mind across the board and take in the bishop and the critical structural components. The structure and the bishop are one. Neither has any intrinsic value outside of its relation to the other, and they are chunked together in the mind.
    This new integration of knowledge has a peculiar effect, because I begin to realize that the initial maxims of piece value are far from ironclad. The pieces gradually lose absolute identity. I learn that rooks and bishops work more efficiently together than rooks and knights, but queens and knights tend to have an edge over queens and bishops. Each piece’s power is purely relational, depending upon such variables as pawn structure and surrounding forces. So now when you look at a knight, you see its potential in the context of the bishop a few squares away. Over time each chess principle loses rigidity, and you get better and better at reading the subtle signs of qualitative relativity. Soon enough, learning becomes unlearning. The stronger chess player is often the one who is less attached to a dogmatic interpretation of the principles. This leads to a whole new layer of principles—those that consist of the exceptions to the initial principles. Of course the next step is for those counterintuitive signs to become internalized just as the initial movements of the pieces were. The network of my chess knowledge now involves principles, patterns, and chunks of information, accessed through a whole new set of navigational principles, patterns, and chunks of information, which are soon followed by another set of principles and chunks designed to assist in the interpretation of the last. Learning chess at this level becomes sitting with paradox, being at peace with and navigating the tension of competing truths, letting go of any notion of solidity.
    “Sitting with paradox, being at peace with and navigating the tension of competing truths, letting go of any notion of solidity” also sounds to me like some of the models for higher stages of moral development, where one moves past the stage of trying to explicitly systematize morality and can treat entire systems of morality as things that all co-exist in one’s mind and are applicable in different situations. Which would make sense, if moral reasoning is a skill in the same sense that playing chess is a skill, and moral preferences are analogous to a chess expert’s preferences for which piece to play where.
    - Matthew_Opitz 29 Oct 2023 0:42 UTC
      3 points
      −2
      Parent
      Except that chess really does have an objectively correct value systemization, which is “win the game.” “Sitting with paradox” just means, don’t get too attached to partial systemizations. It reminds me of Max Stirner’s egoist philosophy, which emphasized that individuals should not get hung up on partial abstractions or “idées fixées” (honesty, pleasure, success, money, truth, etc.) except perhaps as cheap, heuristic proxies for one’s uber-systematized value of self-interest, but one should instead always keep in mind the overriding abstraction of self-interest and check in periodically as to whether one’s commitment to honesty, pleasure, success, money, truth, or any of these other “spooks” really are promoting one’s self-interest (perhaps yes, perhaps no).
      - Kaj_Sotala 29 Oct 2023 7:00 UTC
        4 points
        3
        Parent
        
        Except that chess really does have an objectively correct value systemization, which is “win the game.”
        
        Your phrasing sounds like you might be saying this as an objection to what I wrote, but I’m not sure how it would contradict my comment.
        
        The same mechanisms can still apply even if the correct systematization is subjective in one case and objective in the second case. Ultimately what matters is that the cognitive system feels that one alternative is better than the other and takes that feeling as feedback for shaping future behavior, and I think that the mechanism which updates on feedback doesn’t really see whether the source of the feedback is something we’d call objective (win or loss at chess) or subjective (whether the resulting outcome was good in terms of the person’s pre-existing values).
        
        “Sitting with paradox” just means, don’t get too attached to partial systemizations.
        
        Yeah, I think that’s a reasonable description of what it means in the context of morality too.
  - Thane Ruthenis 28 Oct 2023 0:10 UTC
    LW: 5 AF: 2
    −2
    AF Parent
    But I think the mistake you’re making is to assume that the lower levels are preserved after finding higher-level abstractions. Instead, higher-level abstractions reframe the way we think about lower-level abstractions, which can potentially change them dramatically
    Mm, I think there’s two things being conflated there: ontological crises (even small-scale ones, like the concept of fitness not being outright destroyed but just re-shaped), and the simple process of translating your preference around the world-model without changing that world-model.
    It’s not actually the case that the derivation of a higher abstraction level always changes our lower-level representation. Again, consider people → social groups → countries. Our models of specific people we know, how we relate to them, etc., don’t change just because we’ve figured out a way to efficiently reason about entire groups of people at once. We can now make better predictions about the world, yes, we can track the impact of more-distant factors on our friends, but we don’t actually start to care about our friends in a different way in the light of all this.
    In fact: Suppose we’ve magically created an agent that already starts our with a perfect world-model. It’ll never experience an ontology crisis in its life. This agent would still engage in value translation as I’d outlined. If it cares about Alice and Bob, for example, and it’s engaging in plotting at the geopolitical scales, it’d still be useful for it to project its care for Alice and Bob into higher abstraction levels, and start e. g. optimizing towards the improvement of the human economy. But optimizing for all humans’ welfare would still remain an instrumental goal for it, wholly subordinate to its love for the two specific humans.
    Similarly, suppose you have two deontological values which trade off against each other. Before systematization, the question of “what’s the right way to handle cases where they conflict” is not really well-defined; you have no procedure for doing so
    I think you do, actually? Inasmuch as real-life deontologists don’t actually shut down when facing a values conflict. They ultimately pick one or the other, in a show of revealed preferences. (They may hesitate a lot, yes, but their cognitive process doesn’t get literally suspended.)
    I model this just as an agent having two utility functions, $u_{1}$ and $u_{2}$ , and optimizing for their sum $u_{1} + u_{2}$ . If the values are in conflict, if taking an action that maximizes $u_{1}$ hurts $u_{2}$ and vice versa — well, one of them almost surely spits out a higher value, so the maximization of $u_{1} + u_{2}$ is still well-defined. And this is how that goes in practice: the deontologist hesitates a bit, figuring out which it values more, but ultimately acts.
    There’s a different story about “pruning” values that I haven’t fully thought out yet, but it seems simple at a glance. E. g, suppose you have values $u_{1}$ , $u_{2}$ , $u_{3}$ , but optimizing for $u_{2}$ is always predicted to minimize $u_{1}$ and $u_{3}$ , and $u_{2}$ is always smaller than $u_{1} + u_{3}$ . (E. g., a psychopath loves money, power, and expects to get a slight thrill if he publicly kills a person.) In that case, it makes sense to just delete $u_{2}$ — it’s pointless to waste processing power on including it in your tradeoff computations, since it’s always outvoted (the psychopath conditions himself to remove the homicidal urge).
    There’s some more general principle here, where agents notice such consistently-outvoted scenarios and modify their values into a format where they’re effectively equivalent (still lead to the exact same actions in all situations) but simpler to compute. E. g., if $u_{2}$ sometimes got high enough to outvote $u_{1} + u_{3}$ , it’d still make sense for the agent to optimize it by replacing it with $u_{2}^{'}$ that only activated on those higher values (and didn’t pointlessly muddy up the computations otherwise).
    But note that all of this is happening at the same abstraction level. It’s not how you go from deontology to utilitarianism — it’s how you work out the kinks in your deontological framework.
    - Richard_Ngo 28 Oct 2023 19:19 UTC
      LW: 8 AF: 3
      5
      AF Parent
      It’s not actually the case that the derivation of a higher abstraction level always changes our lower-level representation. Again, consider people → social groups → countries. Our models of specific people we know, how we relate to them, etc., don’t change just because we’ve figured out a way to efficiently reason about entire groups of people at once. We can now make better predictions about the world, yes, we can track the impact of more-distant factors on our friends, but we don’t actually start to care about our friends in a different way in the light of all this.
      
      I actually think this type of change is very common—because individuals’ identities are very strongly interwoven with the identities of the groups they belong to. You grow up as a kid and even if you nominally belong to a given (class/political/religious) group, you don’t really understand it very well. But then over time you construct your identity as X type of person, and that heavily informs your friendships—they’re far less likely to last when they have to bridge very different political/religious/class identities. E.g. how many college students with strong political beliefs would say that it hasn’t impacted the way they feel about friends with opposing political beliefs?
      Inasmuch as real-life deontologists don’t actually shut down when facing a values conflict. They ultimately pick one or the other, in a show of revealed preferences.
      I model this just as an agent having two utility functions, $u_{1}$ and $u_{2}$ , and optimizing for their sum $u_{1} + u_{2}$ .
      This is a straightforwardly incorrect model of deontologists; the whole point of deontology is rejecting the utility-maximization framework. Instead, deontologists have a bunch of rules and heuristics (like “don’t kill”). But those rules and heuristics are underdefined in the sense that they often endorse different lines of reasoning which give different answers. For example, they’ll say pulling the lever in a trolley problem is right, but pushing someone onto the tracks is wrong, but also there’s no moral difference between doing something via a lever or via your own hands.
      I guess technically you could say that the procedure for resolving this is “do a bunch of moral philosophy” but that’s basically equivalent to “do a bunch of systematization”.
      Suppose we’ve magically created an agent that already starts our with a perfect world-model. It’ll never experience an ontology crisis in its life. This agent would still engage in value translation as I’d outlined.
      ...
      But optimizing for all humans’ welfare would still remain an instrumental goal for it, wholly subordinate to its love for the two specific humans.
      Yeah, I totally agree with this. The question is then: why don’t translated human goals remain instrumental? It seems like your answer is basically just that it’s a design flaw in the human brain, of allowing value drift; the same type of thing which could in principle happen in an agent with a perfect world-model. And I agree that this is probably part of the effect. But it seems to me that, given that humans don’t have perfect world-models, the explanation I’ve given (that systematization makes our values better-defined) is more likely to be the dominant force here.
      - Thane Ruthenis 28 Oct 2023 21:20 UTC
        LW: 4 AF: 2
        −2
        AF Parent
        I actually think this type of change is very common—because individuals’ identities are very strongly interwoven with the identities of the groups they belong to
        Mm, I’ll concede that point. I shouldn’t have used people as an example; people are messy.
        Literal gears, then. Suppose you’re studying some massive mechanism. You find gears in it, and derive the laws by which each individual gear moves. Then you grasp some higher-level dynamics, and suddenly understand what function a given gear fulfills in the grand scheme of things. But your low-level model of a specific gear’s dynamics didn’t change — locally, it was as correct as it could ever be.
        And if you had a terminal utility function over that gear (e. g., “I want it to spin at the rate of 1 rotation per minutes”), that utility function won’t change in the light of your model expanding, either. Why would it?
        the whole point of deontology is rejecting the utility-maximization framework. Instead, deontologists have a bunch of rules and heuristics
        … which can be represented as utility functions. Take a given deontological rule, like “killing is bad”. Let’s say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that “predicts” that you’re very likely/unlikely to take specific actions. Probability distributions of this form could be transformed into utility functions by reverse-softmaxing them; thus, it’s perfectly coherent to model a deontologist as an agent with a lot of separate utility functions.
        See Friston’s predictive-processing framework in neuroscience, plus this (and that comment).
        Deontologists reject utility-maximization in the sense that they refuse to engage in utility-maximizing calculations using their symbolic intelligence, but similar dynamics are still at play “under the hood”.
        It seems like your answer is basically just that it’s a design flaw in the human brain, of allowing value drift
        Well, not a flaw as such; a design choice. Humans are trained in an on-line regime, our values are learned from scratch, and… this process of active value learning just never switches off (although it plausibly slows down with age, see old people often being “set in their ways”). Our values change by the same process by which they were learned to begin with.
        Mo Putera 28 Dec 2023 13:08 UTC
        3 points
        0
        Parent
        Tangentially:
        See Friston’s predictive-processing framework in neuroscience
        Nostalgebraist has argued that Friston’s ideas here are either vacuous or a nonstarter, in case you’re interested.
        Thane Ruthenis 28 Dec 2023 22:49 UTC
        1 point
        0
        Parent
        Yeah, I’m familiar with that view on Friston, and I shared it for a while. But it seems there’s a place for that stuff after all. Even if the initial switch to viewing things probabilistically is mathematically vacuous, it can still be useful: if viewing cognition in that framework makes it easier to think about (and thus theorize about).
        Much like changing coordinates from Cartesian to polar is “vacuous” in some sense, but makes certain problems dramatically more straightforward to think through.
        Richard_Ngo 11 Jan 2024 18:17 UTC
        LW: 2 AF: 2
        0
        AF Parent
        (drafted this reply a couple months ago but forgot to send it, sorry)
        your low-level model of a specific gear’s dynamics didn’t change — locally, it was as correct as it could ever be.
        And if you had a terminal utility function over that gear (e. g., “I want it to spin at the rate of 1 rotation per minutes”), that utility function won’t change in the light of your model expanding, either. Why would it?
        Let me list some ways in which it could change:
        Your criteria for what counts as “the same gear” changes as you think more about continuity of identity over time. Once the gear stars wearing down, this will affect what you choose to do.
        After learning about relativity, your concepts of “spinning” and “minutes” change, as you realize they depend on the reference frame of the observer.
        You might realize that your mental pointer to the gear you care about identified it in terms of its function not its physical position. For example, you might have cared about “the gear that was driving the piston continuing to rotate”, but then realize that it’s a different gear that’s driving the piston than you thought.
        These are a little contrived. But so too is the notion of a value that’s about such a basic phenomenon as a single gear spinning. In practice almost all human values are (and almost all AI values will be) focused on much more complex entities, where there’s much more room for change as your model expands.
        Take a given deontological rule, like “killing is bad”. Let’s say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that “predicts” that you’re very likely/unlikely to take specific actions. Probability distributions of this form could be transformed into utility functions by reverse-softmaxing them; thus, it’s perfectly coherent to model a deontologist as an agent with a lot of separate utility functions.
        This doesn’t actually address the problem of underspecification, it just shuffles it somewhere else. When you have to choose between two bad things, how do you do so? Well, it depends on which probability distributions you’ve chosen, which have a number of free parameters. And it depends very sensitively on free parameters, because the region where two deontological rules clash is going to be a small proportion of your overall distribution.
        Thane Ruthenis 11 Jan 2024 19:10 UTC
        LW: 2 AF: 1
        0
        AF Parent
        Let me list some ways in which it could change:
        If I recall correctly, the hypothetical under consideration here involved an agent with an already-perfect world-model, and we were discussing how value translation up the abstraction levels would work in it. That artificial setting was meant to disentangle the “value translation” phenomenon from the “ontology crisis” phenomenon.
        
        Shifts in the agent’s model of what counts as “a gear” or “spinning” violate that hypothetical. And I think they do fall under the purview of ontology-crisis navigation.
        
        Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn’t forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear’s role in the whole mechanism)?
        I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that’s a separate topic. I think that, if you have some value $v (x)$ , and it doesn’t run into direct conflict with any other values you have, and your model of $x$ isn’t wrong at the abstraction level it’s defined at, you’ll never want to change $v (x)$ .
        You might realize that your mental pointer to the gear you care about identified it in terms of its function not its physical position
        That’s the closest example, but it seems to be just an epistemic mistake? Your value is well-defined over “the gear that was driving the piston”. After you learn it’s a different gear from the one you thought, that value isn’t updated: you just naturally shift it to the real gear.
        Plainer example: Suppose you have two bank account numbers at hand, A and B. One belongs to your friend, another to a stranger. You want to wire some money to your friend, and you think A is their account number. You prepare to send the money… but then you realize that was a mistake, and actually your friend’s number is B, so you send the money there. That didn’t involve any value-related shift.
        I’ll try again to make the human example work. Suppose you love your friend, and your model of their personality is accurate – your model of what you value is correct at the abstraction level at which “individual humans” are defined. However, there are also:
        Some higher-level dynamics you’re not accounting for, like the impact your friend’s job has on the society.
        Some lower-level dynamics you’re unaware of, like the way your friend’s mind is implemented at the levels of cells and atoms.
        My claim is that, unless you have terminal preferences over those other levels, then learning to model these higher- and lower-level dynamics would have no impact on the shape of your love for your friend.
        Granted, that’s an unrealistic scenario. You likely have some opinions on social politics, and if you learned that your friend’s job is net-harmful at the societal level, that’ll surely impact your opinion of them. Or you might have conflicting same-level preferences, like caring about specific other people, and learning about these higher-level societal dynamics would make it clear to you that your friend’s job is hurting them. Less realistically, you may have some preferences over cells, and you may want to… convince your friend to change their diet so that their cellular composition is more in-line with your aesthetic, or something weird like that.
        But if that isn’t the case – if your value is defined over an accurate abstraction and there are no other conflicting preferences at play – then the mere fact of putting it into a lower- or higher-level context won’t change it.
        Much like you’ll never change your preferences over a gear’s rotation if your model of the mechanism at the level of gears was accurate – even if you were failing to model the whole mechanism’s functionality or that gear’s atomic composition.
        (I agree that it’s a pretty contrived setup, but I think it’s very valuable to tease out the specific phenomena at play – and I think “value translation” and “value conflict resolution” and “ontology crises” are highly distinct, and your model somewhat muddles them up.)
        ^
        Although there may be higher-level dynamics you’re not tracking, or lower-level confusions. See the friend example below.
        Richard_Ngo 12 Jan 2024 2:24 UTC
        LW: 3 AF: 2
        0
        AF Parent
        Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn’t forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear’s role in the whole mechanism)?
        I think some of my examples do this. E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating, and it’s fine if the specific gear gets swapped out for a copy. I’m not assuming there’s a mistake anywhere, I’m just assuming you switch from caring about one type of property it has (physical) to another (functional).
        In general, in the higher-abstraction model each component will acquire new relational/functional properties which may end up being prioritized over the physical properties it had in the lower-abstraction model.
        I picture you saying “well, you could just not prioritize them”. But in some cases this adds a bunch of complexity. E.g. suppose that you start off by valuing “this particular gear”, but you realize that atoms are constantly being removed and new ones added (implausibly, but let’s assume it’s a self-repairing gear) and so there’s no clear line between this gear and some other gear. Whereas, suppose we assume that there is a clear, simple definition of “the gear that moves the piston”—then valuing that could be much simpler.
        Zooming out: previously you said
        I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that’s a separate topic. I think that, if you have some value $v (x)$ , and it doesn’t run into direct conflict with any other values you have, and your model of $x$ isn’t wrong at the abstraction level it’s defined at, you’ll never want to change $v (x)$ .
        I’m worried that we’re just talking about different things here, because I totally agree with what you’re saying. My main claims are twofold. First, insofar as you value simplicity (which I think most agents strongly do) then you’re going to systematize your values. And secondly, insofar as you have an incomplete ontology (which every agent does) and you value having well-defined preferences over a wide range of situations, then you’re going to systematize your values.
        Separately, if you have neither of these things, you might find yourself identifying instrumental strategies that are very abstract (or very concrete). That seems fine, no objections there. If you then cache these instrumental strategies, and forget to update them, then that might look very similar to value systematization or concretization. But it could also look very different—e.g. the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations. So I think there are two separate things going on here.
        Thane Ruthenis 12 Jan 2024 3:07 UTC
        LW: 2 AF: 1
        0
        AF Parent
        E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating
        That seems more like value reflection, rather than a value change?
        The way I’d model it is: you have some value $v (x)$ , whose implementations you can’t inspect directly, and some guess about what it is $P (v (x))$ . (That’s how it often works in humans: we don’t have direct knowledge of how some of our values are implemented.) Before you were introduced to the question $Q$ of “what if we swap the gear for a different one: which one would you care about then?”, your model of that value put the majority of probability mass on $v_{1} (x)$ , which was “I value this particular gear”. But upon considering $Q$ , your PD over $v (x)$ changed, and now it puts most probability on $v_{2} (x)$ , defined as “I care about whatever gear is moving the piston”.
        Importantly, that example doesn’t seem to involve any changes to the object-level model of the mechanism? Just the newly-introduced possibility of switching the gear. And if your values shift in response to previously-unconsidered hypotheticals (rather than changes to the model of the actual reality), that seems to be a case of your learning about your values. Your model of your values changing, rather than them changing directly.
        (Notably, that’s only possible in scenarios where you don’t have direct access to your values! Where they’re black-boxed, and you have to infer their internals from the outside.)
        the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations
        Sounds right, yep. I’d argue that translating a value up the abstraction levels would almost surely lead to simpler cached strategies, though, just because higher levels are themselves simpler. See my initial arguments.
        insofar as you value simplicity (which I think most agents strongly do) then you’re going to systematize your values
        Sure, but: the preference for simplicity needs to be strong enough to overpower the object-level values it wants to systematize, and it needs to be stronger than them the more it wants to shift them. The simplest values are no values, after all.
        I suppose I see what you’re getting at here, and I agree that it’s a real dynamic. But I think it’s less important/load-bearing to how agents work than the basic “value translation in a hierarchical world-model” dynamic I’d outlined. Mainly because it routes through the additional assumption of the agent having a strong preference for simplicity.
        And I think it’s not even particularly strong in humans? “I stopped caring about that person because they were too temperamental and hard-to-please; instead, I found a new partner who’s easier to get along with” is something that definitely happens. But most instances of value extrapolation aren’t like this.
  - Wei Dai 28 Oct 2023 21:24 UTC
    LW: 4 AF: 3
    0
    AF Parent
    
    Similarly, suppose you have two deontological values which trade off against each other. Before systematization, the question of “what’s the right way to handle cases where they conflict” is not really well-defined; you have no procedure for doing so.
    
    Why is this a problem, that calls out to be fixed (hence leading to systematization)? Why not just stick with the default of “go with whichever value/preference/intuition that feels stronger in the moment”? People do that unthinkingly all the time, right? (I have my own thoughts on this, but curious if you agree with me or what your own thinking is.)
    
    And that’s why the “mind itself wants to do this” does make sense, because it’s reasonable to assume that highly capable cognitive architectures will have ways of identifying aspects of their thinking that “don’t make sense” and correcting them.
    
    How would you cash out “don’t make sense” here?