Rob Bensinger comments on Safety engineering, target selection, and alignment theory

Rob Bensinger 2 Jan 2016 8:34 UTC
6 points
Clippy is a thought experiment used to illustrate two ideas: terminal goals are orthogonal to capabilities (“the AI does not love you”), and they tend to have instrumental goals like resource acquisition and self-preservation (“the AI does not hate you, but...”). This highlights the fact that highly capable AI can be dangerous even if it’s reliably pursuing some known goal and the goal isn’t ambitious or malicious. For that reason, Clippy comes up a lot as an intuition pump for why we need to get started early on safety research.

But ‘a system causes harm in the course of reliably pursuing some known, stable, obviously-non-humane goal’ is a very small minority of the actual disaster scenarios MIRI researchers are worried about. Not because it looks easy to go from a highly reliable diamond maximizer to an aligned superintelligence, but because there appear to be a larger number of ways things can go wrong before we get to that point.
1. We can fail to understand an advanced AI system well enough to know how ‘goals’ are encoded in it, forcing us to infer and alter goals indirectly.
2. We can understand the system’s ‘goals,’ but have them be in the wrong idiom for a safe superintelligence (e.g., rewards for a reinforcement learner).
3. We can understand the system well enough to specify its goals, but not understand our own goals fully or precisely enough to specify them correctly. We come up with an intuitively ‘friendly’ goal (something more promising-sounding than ‘maximize the number of paperclips’), but it’s still the wrong goal.
4. Similarly: We can understand the system well enough to specify safe behavior in its initial context, but the system stops being safe after it or its environment undergoes a change. An example of this is instability under self-modification.
5. We can design advanced AI systems we don’t realize (or don’t care) have consequentialist goals. This includes systems we don’t realize are powerful optimizers, e.g., ones whose goal-oriented behavior may depend in complicated ways on the interaction of multiple AI systems, or ones that function as unnoticed subsystems of non-consequentialists.
- [deleted] 2 Jan 2016 18:36 UTC
  3 points
  Parent
  Ok, so now I’m understanding, and I think our models match up better than I’d thought. You’re basically saying that (1)-(2) and (4)-(5) are a major portion of the alignment research that actually needs doing, even while (3) has become, so to speak, the famous “Hard Problem of” FAI, when in fact it’s only (let’s lazily call it) 20% of what actually needs doing.
  
  I can also definitely buy, based on what I’ve read, that better formalisms for 1, 2, 4, and 5 can all help make (3) easier.