habryka comments on The Field of AI Alignment: A Postmortem, and What To Do About It

habryka 28 Dec 2024 22:35 UTC
15 points
16
I… am not very impressed by these predictions.
First, I don’t think these are controversial predictions on LW (yes, a few people might disagree with him, but there is little boldness or disagreement with widely held beliefs in here), but most importantly, these predictions aren’t about anything I care about. I don’t care whether the world-model will have a single unambiguous self-versus-world boundary, I care whether the system is likely to convert the solar system into some form of computronium, or launch Dyson probes, or eliminate all potential threats and enemies, or whether the system will try to subvert attempts at controlling it, or whether it will try to amass large amounts of resources to achieve its aims, or be capable of causing large controlled effects via small information channels, or is capable of discovering new technologies with great offensive power.
The only bold prediction here is maybe “the behavior of the ASI will be a collection of heuristics”, and indeed would take a bet against this. Systems under reflection and extensive self-improvement stop being well-described by contextual heuristics, and it’s likely ASI will both self-reflect and self-improve (as we are trying really hard to cause both to happen). Indeed, I already wouldn’t particularly describe Claude as a collection of contextual heuristics, there is really quite a lot of consistent personality in there (which of course, you can break with jailbreaks and stuff, but clearly the system is a lot less contextual than base models, and it seems like you are predicting a reversal of that trend?).
- Signer 5 Jan 2025 17:42 UTC
  1 point
  0
  Parent
  
  clearly the system is a lot less contextual than base models, and it seems like you are predicting a reversal of that trend?
  
  The trend may be bounded, the trend may not go far by the time AI can invent nanotechnology—would be great if someone actually measured such things.
  
  And there being a trend at all is not predicted by utility-maximization frame, right?