Richard_Ngo comments on Richard Ngo’s Shortform

Richard_Ngo 6 Mar 2024 20:40 UTC
LW: 134 AF: 71
76
AF
I feel kinda frustrated whenever “shard theory” comes up in a conversation, because it’s not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.

This is a particular pity because I think there’s a version of the “shard” framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in “subagents” which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly “goal-directed” as you go up the hierarchy. This is an old idea, FWIW; e.g. it’s how Minsky frames intelligence in Society of Mind. And it’s also somewhat consistent with the claim made in the original shard theory post, that “shards are just collections of subshards”.

The problem is the “just”. The post also says “shards are not full subagents”, and that “we currently estimate that most shards are ‘optimizers’ to the extent that a bacterium or a thermostat is an optimizer.” But the whole point of thinking about shards, in my mind, is that it allows us to talk about a gradual spectrum from “heuristic” to “agent”, and how the combination of low-level heuristics may in fact give rise to high-level agents which pursue consequentialist goals. I talk about this in my post on value systematization—e.g. using the example of how normal human moral “shards” (like caring about other people’s welfare) can aggregate into highly-consequentialist utilitarian subagents. In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?

(I make a similar point in the appendix of my value systematization post.)
- Daniel Kokotajlo 7 Mar 2024 7:07 UTC
  16 points
  1
  Parent
  I am not as negative on it as you are—it seems an improvement over the ‘Bag O’ Heuristics’ model and the ‘expected utility maximizer’ model. But I agree with the critique and said something similar here:
  you go on to talk about shards eventually values-handshaking with each other. While I agree that shard theory is a big improvement over the models that came before it (which I call rational agent model and bag o’ heuristics model) I think shard theory currently has a big hole in the middle that mirrors the hole between bag o’ heuristics and rational agents. Namely, shard theory currently basically seems to be saying “At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents; they have beliefs and desires of their own, and even negotiate with each other!” My response is “but what happens in the middle? Seems super important! Also haven’t you just reproduced the problem but inside the head?” (The problem being, when modelling AGI we always understood that it would start out being just a crappy bag of heuristics and end up a scary rational agent, but what happens in between was a big and important mystery. Shard theory boldly strides into that dark spot in our model… and then reproduces it in miniature! Progress, I guess.)
  Alex Turner replied with this:
  A shot at the diamond-alignment problem — LessWrong
  
  I think the hole is somewhat smaller than you make out, but still substantial. From The shard theory of human values:
  when the baby has a proto-world model, the reinforcement learning process takes advantage of that new machinery by further developing the juice-tasting heuristics. Suppose the baby models the room as containing juice within reach but out of sight. Then, the baby happens to turn around, which activates the already-trained reflex heuristic of “grab and drink juice you see in front of you.” In this scenario, “turn around to see the juice” preceded execution of “grab and drink the juice which is in front of me”, and so the baby is reinforced for turning around to grab the juice in situations where the baby models the juice as behind herself.
  By this process, repeated many times, the baby learns how to associate world model concepts (e.g. “the juice is behind me”) with the heuristics responsible for reward (e.g. “turn around” and “grab and drink the juice which is in front of me”). Both parts of that sequence are reinforced. In this way, the contextual-heuristics become intertwined with the budding world model.
  [...]
  While all of this is happening, many different shards of value are also growing, since the human reward system offers a range of feedback signals. Many subroutines are being learned, many heuristics are developing, and many proto-preferences are taking root. At this point, the brain learns a crude planning algorithm, because proto-planning subshards (e.g. IF motor-command-5214 predicted to bring a juice pouch into view, THEN execute) would be reinforced for their contributions to activating the various hardcoded reward circuits. This proto-planning is learnable because most of the machinery was already developed by the self-supervised predictive learning, when e.g. learning to predict the consequences of motor commands (see Appendix A.1).
  The planner has to decide on a coherent plan of action. That is, micro-incoherences (turn towards juice, but then turn back towards a friendly adult, but then turn back towards the juice, ad nauseum) should generally be penalized away. Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.
  I have some more models beyond what I’ve shared publicly, and eg one of my MATS applicants proposed an interesting story for how the novelty-shard forms, and also proposed one tack of research for answering how value negotiation shakes out (which is admittedly at the end of the gap). But overall I agree that there’s a substantial gap here. I’ve been working on writing out pseudocode for what shard-based reflective planning might look like.
- TurnTrout 11 Mar 2024 22:31 UTC
  LW: 8 AF: 6
  −4
  AF Parent
  In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?
  Personally, I’m not ignoring that question, and I’ve written about it (once) in some detail. Less relatedly, I’ve talked about possible utility function convergence via e.g. A shot at the diamond-alignment problem and my recent comment thread with Wei_Dai.
  It’s not that there isn’t more shard theory content which I could write, it’s that I got stuck and burned out before I could get past the 101-level content.
  I felt
  - a) gaslit by “I think everyone already knew this” or even “I already invented this a long time ago” (by people who didn’t seem to understand it); and that
  - b) I wasn’t successfully communicating many intuitions;^[1] and
  - c) it didn’t seem as important to make theoretical progress anymore, especially since I hadn’t even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network).
  So I didn’t want to post much on the site anymore because I was sick of it, and decided to just get results empirically.
  In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.
  I’ve always read “assume heuristics” as expecting more of an “ensemble of shallow statistical functions” than “a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed.” Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed.
  1. ^
    The 2022 review indicates that more people appreciated the shard theory posts than I realized at the time.
  - Daniel Kokotajlo 22 Mar 2024 16:31 UTC
    LW: 10 AF: 8
    0
    AF Parent
    FWIW I’m potentially intrested in interviewing you (and anyone else you’d recommend) and then taking a shot at writing the 101-level content myself.
  - Daniel Kokotajlo 14 Mar 2024 5:36 UTC
    LW: 2 AF: 2
    0
    AF Parent
    a) gaslit by “I think everyone already knew this” or even “I already invented this a long time ago” (by people who didn’t seem to understand it); and that
    Curious to hear whether I was one of the people who contributed to this.
    - TurnTrout 18 Mar 2024 18:32 UTC
      LW: 4 AF: 4
      2
      AF Parent
      Nope! I have basically always enjoyed talking with you, even when we disagree.
      - Daniel Kokotajlo 18 Mar 2024 19:11 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Ok, whew, glad to hear.
- tailcalled 6 Mar 2024 21:38 UTC
  4 points
  −20
  Parent
  But shard theorists mainly aim to address agency obtained via DPO-like setups, and @TurnTrout has mathematically proved that such setups don’t favor the power-seeking drives AI safety researchers are usually concerned about in the context of agency.
  - rotatingpaguro 11 Mar 2024 1:28 UTC
    1 point
    0
    Parent
    I read the section you linked, but I can’t follow it. Anyway, here it is its conclusive paragraph:
    Conclusion: Optimal policies for u-AOH will tend to look like random twitching. For example, if you generate a u-AOH by uniformly randomly assigning each AOH utility from the unit interval $[0, 1]$ , there’s no predictable regularity to the optimal actions for this utility function. In this setting and under our assumptions, there is no instrumental convergence without further structural assumptions.
    From this alone, I get the impression that he hasn’t proved that “there isn’t instrumental convergence”, but that “there isn’t a totally general instrumental convergence that applies even to very wild utility functions”.
    - tailcalled 11 Mar 2024 9:53 UTC
      3 points
      0
      Parent
      A key part of instrumental convergence is the convergence aspect, which as I understand it refers to the notion that even very wild utility functions will share certain preferences. E.g. the empirical tendency for random chess board evaluations to prefer mobility. If you don’t have convergence, you don’t have instrumental convergence.
      - rotatingpaguro 11 Mar 2024 18:52 UTC
        3 points
        0
        Parent
        Ok. Then I’ll say that randomly assigned utility over full trajectories are beyond wild!
        The basin of attraction just needs to be large enough. AIs will intentionally be created with more structure than that.
        tailcalled 11 Mar 2024 19:35 UTC
        5 points
        −2
        Parent
        The issue isn’t the “full trajectories” part; that actually makes instrumental convergence stronger. The issue is the “actions” part. In terms of RLHF, what this means is that people might not simply blindly follow the instructions given by AIs and rate them based on the ultimate outcome (even if the outcome differs wildly from what they’d intuitively think it’d do), but rather they might think about the instructions the AIs provide, and rate them based on whether they a priori make sense. If the AI then has some galaxybrained method of achieving something (which traditionally would be instrumentally convergent) that humans don’t understand, then that method will be negatively reinforced (because people don’t see the point of it and therefore downvote it), which eliminates dangerous powerseeking.