michaelcohen comments on The Main Sources of AI Risk?

michaelcohen 29 Mar 2019 0:32 UTC
LW: 6 AF: 2
AF
3. Misspecified or incorrectly learned goals/values
I think this phrasing misplaces the likely failure modes. An example that comes to mind from this phrasing is that we mean to maximize conscious flourishing, but we accidentally maximize dopamine in large brains.
Of course, this example includes an agent intervening in the provision of its own reward, but since that seems like the paradigmatic example here, maybe the language could better reflect that, or maybe this could be split into two.
The single technical problem that appears biggest to me is that we don’t know how to align an agent with any goal. If we had an indestructible magic box that printed a number to a screen corresponding to the true amount of Good in the world, we still don’t know how to design an agent that maximizes that number (instead of taking over the world, and tampering with the cameras that are aimed at the screen/the optical character recognition program used to decipher the image). This problems seems to me like the single most fundamental source of AI risk. Is 3 meant to include this?
What links here?
- The Main Sources of AI Risk? by Daniel Kokotajlo (21 Mar 2019 18:28 UTC; 121 points)
- Wei Dai 29 Mar 2019 1:03 UTC
  LW: 2 AF: 1
  AF Parent
  I’m not sure if I meant to include this when I wrote 3, but it does seem like a good idea to break it out into its own item. How would you suggest phrasing it? “Wireheading” or something more general or more descriptive?
  - michaelcohen 29 Mar 2019 1:40 UTC
    LW: 1 AF: 1
    AF Parent
    Maybe something along the lines of “Inability to specify any ‘real-world’ goal for an artificial agent”?