Seth Herd comments on Internal independent review for language model agent alignment

Seth Herd 8 Jul 2023 22:16 UTC
4 points
0
I reread your An LLM-based “exemplary actor”, which amounts to a similar plan to build an aligned LMA. I think you actually sound at least as optimistic as I am.
Many of your concerns are addressed by my focus on corrigibility. I’m nominating corrigibility as the most important, highest-ranked goal to give LMAs (or any other type of AGI, if we could figure out how to give it that goal in a way that generalizes as well as natural language does).
I think you’re right that even an approximately-aligned AGI might have enough divergence from ours to be a problem, and I think that problem is actually way worse than you’re thinking. I’m working on a new post on the alignment stability problem elucidating how a small alignment difference might get worse under reflection. I think the solution for that is long-term corrigibility, so that we can correct divergences in AGI alignment when they (perhaps inevitably) occur.
To your first point: the multipolar scenario with LMAs (many intelligent AIs) does seem like a huge downside. Other approaches share this downside, but it’s made worse if it turns out to be easy to make autonomous LMAs.
On your other points, I agree that the solution is imperfect. I just think other network-based AGI approaches are worse. Probably most algorithmic approaches as well, since arguably Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc). Language-based AGI seems uniquely interpretable, even if it’s not perfectly interpretable.
- Roman Leventov 9 Jul 2023 18:15 UTC
  5 points
  −3
  Parent
  I’m very skeptical of corrigibility as an important property of ‘safe’ AI. Who will decide that AI’s models/actions/decisions are not OK? The global committee of people? Who would be already biased by the existing situation? How corrigibility is distinct from micromanagement? In short, corrigibility is very much about goal alignment rather than model alignment, whereas model alignment is more important and more robust than goal alignment.
  Apart from that, I don’t see how corrigibility addresses any of the problems that I’ve listed.
  I’m working on a new post on the alignment stability problem elucidating how a small alignment difference might get worse under reflection. I think the solution for that is long-term corrigibility, so that we can correct divergences in AGI alignment when they (perhaps inevitably) occur.
  The first problem is noticing these divergences at all. Before we know, AIs will simplify our value and channel human development in a certain restricted direction which which we will no longer see the problem (and even if some lone voices will notice, nobody will stop the giant economic mechanism because of this… Such pleas that go against the economic forces has always lead to absolutely nothing, throughout the last several centuries, with few exceptions. And since AI will have unprecedented control over human thoughts, ideas, emotions, values, etc., this will definitely be hopeless in the future.)
  However, I don’t see alignment stability as such as a problem. Again, I think discussing “stability of goals” is very misguided. Goals should change, all the time, as a reaction to changing circumstances! Including very high-level goals. Models could also change, maybe over longer time intervals, but there is no problem with that. If we know how to model-align AI with humans at the current moment, I don’t see there is a problem with re-training AI every year to re-align it with slowly changing models.
  On your other points, I agree that the solution is imperfect. I just think other network-based AGI approaches are worse. Probably most algorithmic approaches as well, since arguably Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc). Language-based AGI seems uniquely interpretable, even if it’s not perfectly interpretable.
  I don’t see there are any other approaches that look relatively coherent and doable in a few years (maybe except Conjecture’s CoEms, to which I give benefit of possibility because I don’t know what secret ideas they have. Ok, maybe also except Open Agency Architecture).
  Maybe a relatively surprising thing that wasn’t apparent a few years ago that OpenAI’s superalignment plan even doesn’t look that bad. I definitely don’t say that it’s destined to lead to terrible outcomes, as Yudkowsky and others keep insisting on. But I also see enough problems with it (some of which I listed above, but also didn’t list other sociotechnical problems, geopolitical, inadequate execution of the plan, etc.) that I think taking this plan at this moment is reckless. From this perspective, the fact that the plan doesn’t look that bad might be a curse rather than blessing. If there wouldn’t be any adequately-looking plan in sight, maybe it would motivate key decision-makers to do what I think we should actually do: ban AGI development, make humans much smarter and much more peaceful through genetic engineering (solve the Girardian curse of mimesis), solve economic scarcity, innovate global institutions and coordination mechanisms, and then revisit the AGI development task in a Manhattan-style project where the whole humanity works towards the same end.