Shay comments on Various Alignment Strategies (and how likely they are to work)

Shay 5 May 2022 23:24 UTC
1 point
Logan, for your preferred alignment approach how likely is it that the alignment remains durable over time? A superhuman AGI will understand the choices that were made by its creators to align it. It will be capable of comparing its current programming with counterfactuals where it’s not aligned. It will also have the ability to alter its own code. So what if it determines it’s best course of action is to alter the very code that maintains it’s alignment? How would this be prevented?
- Logan Zoellner 6 May 2022 1:23 UTC
  1 point
  Parent
  I will try to do a longer write-up sometime, but in a Bureaucracy of AIs, no individual AI is actually super-human (just as Google collectively knows more than any human being but no individual at Google is super-human).
  It stays aligned because there is always a “human in the loop”, in fact the whole organization simply competes to produce plans which are then approved by human reviewers (under some sort of futarchy-style political system). Importantly, some of the AIs compete by creating plans, and other AIs compete by explaining to humans how dangerous those plans are.
  All of the individual AIs in the Bureaucracy have very strict controls on things like: their source code, their training data, the amount of time they are allowed to run, how much compute they have access to, when and how they communicate with the outside world and with each other. They are very much not allowed to alter their own source code (except after extensive review by the outside humans who govern the system).