Daniel Kokotajlo comments on My disagreements with “AGI ruin: A List of Lethalities”

Daniel Kokotajlo 25 Sep 2024 17:06 UTC
8 points
0
I think that AGIs are more robust to things going wrong than nuclear cores, and more generally I think there is much better evidence for AI robustness than fragility.

I agree with jdp’s comment about robustness vs. fragility, (e.g. I agree that the solidgoldmagikarp thing is not a central example of the sorts of failures to watch out for) but think that this is missing yudkowsky’s point. Running AGIs doing something pivotal are not passively safe; by the time they are anywhere in the ballpark of being competent enough to succeed, the failure mode ‘they become a dangerous adversary of humanity’ is at least as plausible as the failure mode ‘they do something stupid and fizzle out’ and the failure mode ‘they do something obviously evil and get shut down and the problem studied, understood, and fixed.’ (Part of my model here is that there are tempting ways to patch problems other than studying and understanding and fixing them, and in race conditions AGI projects are likely to cut corners and go for the shallow patches. E.g. just train against the bad behavior, or edit the prompt to more clearly say not to do that.)
- Noosphere89 25 Sep 2024 17:24 UTC
  6 points
  0
  Parent
  Agree with this point, though mostly because of the failure mode “they do something stupid and fizzle out” to get less probability in my models as we get closer to AGI and ASI.
  
  I actually agree that there will be far too much temptation to patch problems rather than directly fix problems, and while I do think we may well be able to directly fix misalignment problems in the future (though a lot more of my hopes come from avoiding misalignment in the first place via synthetic data, because prevention is easier than curing a problem), in race conditions, the AI labs could well decide to ditch the techniques that actually fix problems even if it has a reasonable cost in non-race conditions:
  
  (Part of my model here is that there are tempting ways to patch problems other than studying and understanding and fixing them, and in race conditions AGI projects are likely to cut corners and go for the shallow patches. E.g. just train against the bad behavior, or edit the prompt to more clearly say not to do that.)
  
  We will absolutely need to change lab cultures as we get closer to AGI and ASI.
  - Daniel Kokotajlo 25 Sep 2024 20:43 UTC
    2 points
    0
    Parent
    OK cool we are on the same page here also then