Gordon Seidoh Worley comments on Bootstrapped Alignment

Gordon Seidoh Worley 8 Mar 2021 1:07 UTC
3 points
0
“Friendly AI” is a technical term from the past that has mostly been replaced by “aligned AI” today. However, I’m using it here to refer to aligned AI conforming to an aspect of the original proposal for Friendly AI, which is that it be designed to be aligned, say in a mathematically provable way, rather than as an engineered process that approaches alignment by approximation.
It’s still the case that humans are choosing what criteria make a Friendly AI aligned and thus there is some risk of missing the objective of aligned AI, but this avoids Goodharting because there’s no optimization being applied. Of course, it could always slip back in depending on the process used to come up with the criteria a Friendly AI would be built to provably have, thus making the challenge of building one quite hard!
As to your second set of questions that seem to hinge on what I mean by optimization, I just mean choosing one thing over another to try to make the world look one way rather than another. If that still seems vague it’s because optimization is a very common process that basically just requires a feedback loop and a signal (reward functions are a very complex type of signal).
- Adrià Garriga-alonso 8 Mar 2021 10:07 UTC
  3 points
  0
  Parent
  Friendly AI, which is that it be designed to be aligned, say in a mathematically provable way, rather than as an engineered process that approaches alignment by approximation.
  I think I understand that now, thank you!
  this avoids Goodharting because there’s no optimization being applied
  I’m confused again here. Is this implying that a Friendly AI, per the definition above, is not an optimizer?
  I am very pessimitic about being able to align an AI without any sort of feedback loop on the reward (thus, without optimization). The world’s overall transition dynamics are likely to be chaotic, so the “initial state” of an AI that is provably aligned without feedback needs to be exactly the right one to obtain the outcome we want. It could be that the chaos does not affect what we care about, but I’m unsure about that, even linear systems can be chaotic.
  It is not an endeavour as clearly impossible as “build an open-loop controller for this dynamical system”, but I think it’s similar.
  - Gordon Seidoh Worley 8 Mar 2021 18:24 UTC
    2 points
    0
    Parent
    I’m confused again here. Is this implying that a Friendly AI, per the definition above, is not an optimizer?
    No. It’s saying the process by which Friendly AI is designed is not an optimizer (although see my caveats in the previous apply about choosing alignment criteria; it’s still technically optimization but constrained as much as possible to eliminate the normal Goodharting mechanism). The AI itself pretty much has to be an optimizer to do anything useful.
    I am very pessimitic about being able to align an AI without any sort of feedback loop on the reward (thus, without optimization). The world’s overall transition dynamics are likely to be chaotic, so the “initial state” of an AI that is provably aligned without feedback needs to be exactly the right one to obtain the outcome we want. It could be that the chaos does not affect what we care about, but I’m unsure about that, even linear systems can be chaotic.
    I’m similarly pessimistic as it seems quite a hard problem and after 20 years we still don’t really know how to start (or so I think; maybe MIRI folks feel differently and that we have made some real progress here). Hence why maybe bootstrapping to alignment is the best alternative given I think totally abandoning the Friendly AI strategy is also a bad choice.