Noosphere89 comments on My disagreements with “AGI ruin: A List of Lethalities”

Noosphere89 25 Sep 2024 17:24 UTC
4 points
0
Agree with this point, though mostly because of the failure mode “they do something stupid and fizzle out” to get less probability in my models as we get closer to AGI and ASI.

I actually agree that there will be far too much temptation to patch problems rather than directly fix problems, and while I do think we may well be able to directly fix misalignment problems in the future (though a lot more of my hopes come from avoiding misalignment in the first place via synthetic data, because prevention is easier than curing a problem), in race conditions, the AI labs could well decide to ditch the techniques that actually fix problems even if it has a reasonable cost in non-race conditions:

(Part of my model here is that there are tempting ways to patch problems other than studying and understanding and fixing them, and in race conditions AGI projects are likely to cut corners and go for the shallow patches. E.g. just train against the bad behavior, or edit the prompt to more clearly say not to do that.)

We will absolutely need to change lab cultures as we get closer to AGI and ASI.
- Daniel Kokotajlo 25 Sep 2024 20:43 UTC
  2 points
  0
  Parent
  OK cool we are on the same page here also then