Jacob_Hilton comments on Truthful LMs as a warm-up for aligned AGI

Jacob_Hilton 20 Jan 2022 8:07 UTC
LW: 3 AF: 3
AF
“Catch misalignment early...”—This should have been “scary misalignment”, e.g. power-seeking misalignment, deliberate deception in order to achieve human approval, etc., which I don’t think we’ve seen clear signs of in current LMs. My thinking was that in fast takeoff scenarios, we’re less likely to spot this until it’s too late, and more generally that truthful LM work is less likely to “scale gracefully” to AGI. It’s interesting that you don’t share these intuitions.
Does this mean I agree or disagree with “our current picture of the risks is incomplete?”
As mentioned, this phrase should probably be replaced by “a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment”. There isn’t supposed to be a binary cutoff for “significant portion”; the claim is that the greater the risks other than power-seeking misalignment, the greater the comparative advantage of truthful LM work. This is because truthful LM work seems more useful for addressing risks from social problems such AI persuasion (as well as other potential risks that haven’t been as clearly articulated yet, I think). Sorry that my original phrasing was so unclear.
- Daniel Kokotajlo 20 Jan 2022 17:40 UTC
  LW: 3 AF: 3
  AF Parent
  Nothing to apologize for, it was reasonably clear, I’m just trying to learn more about what you believe and why. This has been helpful, thanks!
  I totally agree that in fast takeoff scenarios we are less likely to spot those things until it’s too late. I guess I agree that truthful LM work is less likely to scale gracefully to AGI in fast takeoff scenarios… so I guess I agree with your overall point… I just notice I feel a bit confused and muddle about it, is all. I can imagine plausible slow-takeoff scenarios in which truthful LM work doesn’t scale gracefully, and plausible fast-takeoff scenarios in which it does. At least, I think I can. The former scenario would be something like: It turns out the techniques we develop for making dumb AIs truthful stop working once the AIs get smart, for similar reasons that techniques we use to make small children be honest (or to put it more vividly, believe in santa) stop working once they grow up. The latter scenario would be something like: Actually that’s not the case, the techniques work all the way up past human level intelligence, and “fast takeoff” in practice means “throttled takeoff” where the leading AI project knows they have a few month lead over everyone else and is using those months to do some sort of iterated distillation and amplification, in which it’s crucial that the early stages be truthful and that the techniques scale to stage N overseeing stage N+1.