Rob Bensinger comments on Evaluating the historical value misspecification argument

Rob Bensinger 5 Oct 2023 21:22 UTC
14 points
10
Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.
“You very clearly thought that was a major part of the problem” implies that if you could go to Eliezer-2008 and convince him “we’re going to solve a lot of NLP a bunch of years before we get to ASI”, he would respond with some version of “oh great, that solves a major part of the problem!”. Which I’m pretty sure is false.
In order for GPT-4 (or GPT-2) to be a major optimistic update about alignment, there needs to be a way to leverage “really good NLP” to help with alignment. I think the crux of disagreement is that you think really-good-NLP is obviously super helpful for alignment and should be a big positive update, and Eliezer and Nate and I disagree.
Maybe a good starting point would be for you to give examples of concrete ways you expect really good NLP to put humanity in a better position to wield superintelligence, e.g., if superintelligence is 8 years away?
(Or say some other update we should be making on the basis of “really good NLP today”, like “therefore we’ll probably unlock this other capability X well before ASI, and X likely makes alignment a lot easier via concrete pathway Y”.)
- gallabytes 5 Oct 2023 21:36 UTC
  1 point
  6
  Parent
  To pick a toy example, you can use text as a bottleneck to force systems to “think out loud” in a way which will be very directly interpretable by a human reader, and because language understanding is so rich this will actually be competitive with other approaches and often superior.
  I’m sure you can come up with more ways that the existence of software that understands language and does ~nothing else makes getting computers to do what you mean easier than if software did not understand language. Please think about the problem for 5 minutes. Use a clock.
  - Rob Bensinger 5 Oct 2023 22:03 UTC
    11 points
    0
    Parent
    I appreciate the example!
    Are you claiming that this example solves “a major part of the problem” of alignment? Or that, e.g., this plus four other easy ideas solve a major part of the problem of alignment?
    Examples like the Visible Thoughts Project show that MIRI has been interested in research directions that leverage recent NLP progress to try to make inroads on alignment. But Matthew’s claim seems to be ‘systems like GPT-4 are grounds for being a lot more optimistic about alignment’, and your claim is that systems like these solve “a major part of the problem”. Which is different from thinking ‘NLP opens up some new directions for research that have a nontrivial chance of being at least a tiny bit useful, but doesn’t crack open the problem in any major way’.
    It’s not a coincidence that MIRI has historically worked on problems related to AGI analyzability / understandability / interpretability, rather than working on NLP or machine ethics. We’ve pretty consistently said that:
    The main problems lie in ‘we can safely and reliably aim ASI at a specific goal at all’.
    The problem of going from ‘we can aim the AI at a goal at all’ to ‘we can aim the AI at the right goal (e.g., corrigibly inventing nanotech)’ is a smaller but nontrivial additional step.
    … Whereas I don’t think we’ve ever suggested that good NLP AI would take a major bite out of either of those problems. The latter problem isn’t equivalent to (or an obvious result of) ‘get the AI to understand corrigibility and nanotech’, or for that matter ‘get the AI to understand human preferences in general’.