Adrià Garriga-alonso comments on Bootstrapped Alignment

Adrià Garriga-alonso 8 Mar 2021 10:07 UTC
3 points
Friendly AI, which is that it be designed to be aligned, say in a mathematically provable way, rather than as an engineered process that approaches alignment by approximation.
I think I understand that now, thank you!
this avoids Goodharting because there’s no optimization being applied
I’m confused again here. Is this implying that a Friendly AI, per the definition above, is not an optimizer?
I am very pessimitic about being able to align an AI without any sort of feedback loop on the reward (thus, without optimization). The world’s overall transition dynamics are likely to be chaotic, so the “initial state” of an AI that is provably aligned without feedback needs to be exactly the right one to obtain the outcome we want. It could be that the chaos does not affect what we care about, but I’m unsure about that, even linear systems can be chaotic.
It is not an endeavour as clearly impossible as “build an open-loop controller for this dynamical system”, but I think it’s similar.
- Gordon Seidoh Worley 8 Mar 2021 18:24 UTC
  2 points
  Parent
  I’m confused again here. Is this implying that a Friendly AI, per the definition above, is not an optimizer?
  No. It’s saying the process by which Friendly AI is designed is not an optimizer (although see my caveats in the previous apply about choosing alignment criteria; it’s still technically optimization but constrained as much as possible to eliminate the normal Goodharting mechanism). The AI itself pretty much has to be an optimizer to do anything useful.
  I am very pessimitic about being able to align an AI without any sort of feedback loop on the reward (thus, without optimization). The world’s overall transition dynamics are likely to be chaotic, so the “initial state” of an AI that is provably aligned without feedback needs to be exactly the right one to obtain the outcome we want. It could be that the chaos does not affect what we care about, but I’m unsure about that, even linear systems can be chaotic.
  I’m similarly pessimistic as it seems quite a hard problem and after 20 years we still don’t really know how to start (or so I think; maybe MIRI folks feel differently and that we have made some real progress here). Hence why maybe bootstrapping to alignment is the best alternative given I think totally abandoning the Friendly AI strategy is also a bad choice.