Nate Showell comments on The Field of AI Alignment: A Postmortem, and What To Do About It

Nate Showell 28 Dec 2024 22:28 UTC
4 points
0
Some concrete predictions:
- The behavior of the ASI will be a collection of heuristics that are activated in different contexts.
- The ASI’s software will not have any component that can be singled out as the utility function, although it may have a component that sets a reinforcement schedule.
- The ASI will not wirehead.
- The ASI’s world-model won’t have a single unambiguous self-versus-world boundary. The situational awareness of the ASI will have more in common with that of an advanced meditator than it does with that of an idealized game-theoretic agent.
- habryka 28 Dec 2024 22:35 UTC
  15 points
  16
  Parent
  I… am not very impressed by these predictions.
  First, I don’t think these are controversial predictions on LW (yes, a few people might disagree with him, but there is little boldness or disagreement with widely held beliefs in here), but most importantly, these predictions aren’t about anything I care about. I don’t care whether the world-model will have a single unambiguous self-versus-world boundary, I care whether the system is likely to convert the solar system into some form of computronium, or launch Dyson probes, or eliminate all potential threats and enemies, or whether the system will try to subvert attempts at controlling it, or whether it will try to amass large amounts of resources to achieve its aims, or be capable of causing large controlled effects via small information channels, or is capable of discovering new technologies with great offensive power.
  The only bold prediction here is maybe “the behavior of the ASI will be a collection of heuristics”, and indeed would take a bet against this. Systems under reflection and extensive self-improvement stop being well-described by contextual heuristics, and it’s likely ASI will both self-reflect and self-improve (as we are trying really hard to cause both to happen). Indeed, I already wouldn’t particularly describe Claude as a collection of contextual heuristics, there is really quite a lot of consistent personality in there (which of course, you can break with jailbreaks and stuff, but clearly the system is a lot less contextual than base models, and it seems like you are predicting a reversal of that trend?).
  - Signer 5 Jan 2025 17:42 UTC
    1 point
    0
    Parent
    
    clearly the system is a lot less contextual than base models, and it seems like you are predicting a reversal of that trend?
    
    The trend may be bounded, the trend may not go far by the time AI can invent nanotechnology—would be great if someone actually measured such things.
    
    And there being a trend at all is not predicted by utility-maximization frame, right?
- Leon Lang 29 Dec 2024 0:38 UTC
  4 points
  2
  Parent
  “heuristics activated in different contexts” is a very broad prediction. If “heuristics” include reasoning heuristics, then this probably includes highly goal-oriented agents like Hitler.
  
  Also, some heuristics will be more powerful and/or more goal-directed, and those might try to preserve themselves (or sufficiently similar processes) more so than the shallow heuristics. Thus, I think eventually, it is plausible that a superintelligence looks increasingly like a goal-maximizer.