Fiora Sunshine comments on Against Yudkowsky’s evolution analogy for AI x-risk [unfinished]

Fiora Sunshine Mar 18, 2025, 9:03 PM
2 points
1
it seems unlikely to me that they’ll end up with like, strong, globally active goals in the manner of an expected utility maximizer, and it’s not clear to me that it’s likely for the goals they do develop to end up sufficiently misaligned as to cause a catastrophe. like… you get LLMs to situationally steer certain situations in certain directions by RLing it when it actually does steer those situations in those directions; if you do that enough, hopefully it catches the pattern. and… to the extent that it doesn’t catch the pattern, it’s not clear that it will instead steer those kinds of situations (let alone all situations) towards some catastrophic outcome. their misgeneralizations can just result in noise, or taking actions that steer certain situations into weird but ultimately harmless territory. it seems like the catastrophic outcomes are a very small subset of the ways this could end up going wrong, since you’re not giving them goals to pursue relentlessly, you’re just giving them feedback on the ways you want them to behave in particular types of situations.
- Seth Herd Mar 18, 2025, 9:25 PM
  2 points
  0
  Parent
  Hm. I think you’re thinking of current LLMs, not AGI agents based on LLMs? If so I fully agree that they’re unlkely to be dangerous at all.
  I’m worried about agentic cognitive architectures we’ve built with LLMs as the core cognitive engine. We are trying to make them goal directed and to have human-level competence; superhuman competence/intelligence follows after that if we don’t somehow halt progress permanently.
  
  Current LLMs, like most humans most of the time, aren’t strongly goal directed. But we want them to be strongly goal-directed so they do the tasks we give them.
  
  Doing a task with full competence is the same as maximizing that goal. Which would be fine if we can define those goals adequately, but we’re not at all sure we can as I emphasized last.
  When you have a goal, pursuing it relentlessly is the default, not some weird special case. Evolution had to carefully balance our different goals with our homeostatic needs, and humans still often adopt strange goals and work toward them energetically (if they have time and money and until they die). And again, humans are dangerous as hell to other humans. Civilization is a sort of detente based on our individually having very limited capabilities so that we need to collaborate to succeed.
  WRT LLMs pursuing goals as though they’re maximizers, they do once they are given a goal to pursue. see the recent post on how RL runaway optimisation problems are still relevant with LLMs.
  I’m not sure how you’re imagining that we have AI that can get really valuable stuff done and we don’t turn it into AGI that has goals because we wanted it to and designed it to pursue long-term goals so they can do real work. They’ll need to be able to solve solve new problems (like “how do I open this file if my first try fails” but general problem-solving extends to “how do I keep the humans from finding out”). That sounds intuitively super dangerous to me.
  I agree that LLMs themselves aren’t likely to be dangerous no matter how smart they get. They’ll only be dangerous once we extend them to persistently pursue goals.
  And we’re hard at work doing exactly that.
  I don’t think this is very relevant, but even if we don’t give them persistent goals, LLM agents that can reflect and remember their conclusions are likely to come up with their own long-term goals—just like people do. I’m writing about that right now and will try to remember to link it here once it’s posted. But the more likely scenario is that they interpret the goals we give them differently than we’d hoped.
  - Fiora Sunshine Mar 18, 2025, 9:54 PM
    1 point
    0
    Parent
    my view is that humans obtain their goals largely by a reinforcement learning process, and that they’re therefore good evidence about both how you can bootstrap up to goal-directed behavior via reinforcement learning, and the limitations of doing so. the basic picture is that humans pursue goals (e.g. me, trying to write the OP) largely as a byproduct of me reliably feeling rewarded during the process, and punished for deviating from that activity. like i enjoy writing and research, and also writing let me feel productive and therefore avoid thinking about some important irl things i’ve been needing to get done for weeks, and these dynamics can be explained basically in the vocabulary of reinforcement learning. this gives us a solid idea of how we’d go about getting similar goals into deep learning-based AGI.
    (edit: also it’s notable that even when writing this post i was sometimes too frustrated, exhausted, or distracted by socialization or the internet to work on it, suggesting it wasn’t actually a 100% relentless goal of mine, and that goals in general don’t have to be that way.)
    it’s also worth noting that getting humans to pursue goals consistently does require kind of meticulous reinforcement learning. like… you can kind of want to do your homework, but find it painful enough to do that you bounce back and forth between doing it and scrolling twitter. same goes for holding down a job or whatever. learning to reliably pursue objectives that foster stability is like, the central project of maturation, and the difficulty of it suggests the difficulty of getting an agent that relentlessly pursues some goal without the RL process being extremely encouraging of them moving along in that direction.
    (one central advantage that humans have over natural selection wrt alignment is that we can much more intelligently evaluate which of an agent’s actions we want to reinforce. natural selection gave us some dumb, simple reinforcement triggers, like cuddles or food or sex, and has to bootstrap up to more complex triggers associatively over the course of a lifetime. but we can use a process like RLAIF to automate the act of intelligently evaluating which actions can be expected to further our actual aims, and reinforce those.)
    anyway, in order for alignment via RL to go wrong, you need a story about how an agent specifically misgeneralizes from its training process to go off and pursue something catastrophic relative to your values, which… doesn’t seem like a super easy outcome to achieve given how reliably you need to reinforce something in order for it to stick as a goal the system ~relentlessly pursues? like surely with that much data, we can rely on deep learning’s obvious in practice tendency to generalize ~correctly...
    - Seth Herd Mar 19, 2025, 1:23 AM
      2 points
      0
      Parent
      I’m actually interested in your responses here. This is useful for my strategies how I frame things and understanding different people’s intuitions.
      
      Do you think we can’t make autonomous agents that pursue goals well enough to get things done? Do you really think they’ll lose focus between being goal-focused long enough for useful work, and long enough for taking over the world if they interpret their goals differently than we intended? Do you think there’s no way RL or natural language could be misinterpreted?
      
      I’m thinking it’s easy to keep an LLM agent goal-focused; if RL doesn’t do it, we’d just have a bit of scaffolding that every so often injects a prompt “remember, keep working on [goal]!”
      
      The inference-compute scaling results seem to indicate that chain of thought RL already has o1 and o3 staying task focused for millions of tokens.
      
      If you’re superintelligent/competent, it doesn’t take 100% focus to take over the world, just occasionally coming back to the project and not completely changing your mind.
      
      Ghengis Khan probably got distracted a lot but he did alright at murdering, and he was only human.
      
      Humans are optimizing AI and then AGI to get things done. If they can do that, we should ask what they’re going to want to do.
      
      Deep learning typically generalizes correctly within the training set. Once something is superintelligent and unstoppable, we’re going to be way outside of the training set.
      
      Humans change their goals all the time, when they reach new conclusions about how the world works and how that changes their interpretations of their previous goals.
      
      I am curious about your intuitions but I’ve got to focus on work so that’s got to be my last object-level contribution. Thanks for conversing.
      - Fiora Sunshine Mar 19, 2025, 1:49 AM
        1 point
        0
        Parent
        I also think it should be easy-ish to keep deep learning-based systems goal-focused, though mostly because I imagine that at some point, we’ll have agents which are actively undergoing more RL while they’re still in deployment. This means you can replicate the way humans learn to stay focused on tasks they’re passionate about by just being positively reinforced for doing it all the time. My contention is just that, to the extent that the RL is misunderstood, it probably won’t lead to a massive catastrophe. It’s hard to think about this in the absence of concrete scenarios, but… I think to get a catastrophe, you need the system to be RL’d in ways that reliably teach it behaviors that steer a given situation towards a catastrophic outcome? I don’t think you like, reliably reinforce the model for being nice to humans, but it misunderstands “being nice to humans” in such a way that causes it to end up steering the future towards some weird undesirable outcome; Claude does well enough at this kind of thing in practice.
        I think a real catastrophe has to look something like… you pretrain a model to give it an understanding of the world, then you RL it to be really good at killing people so you can use it as a military weapon, but you don’t also RL it to be nice to people on your own side, and then it goes rogue and starts killing people on your own side. I guess that’s a kind of “misunderstanding your creators’ intentions”, but like… I expect those kinds of errors to follow from like, fairly tractable oversights in terms of teaching a model the right caveats to intended but dangerous behavior. I don’t think e.g. RLing Claude to give good advice to humans when asked could plausibly lead to it acquiring catastrophic values.
        edit: actually, maybe a good reference point for this is when humans misunderstand their own reward functions? i.e. “i thought i would enjoy this but i didn’t”? i wonder if you could mitigate problems in this area just by telling an llm the principles used for its constitution. i need to think about this more...