3. Misspecified or incorrectly learned goals/values
I think this phrasing misplaces the likely failure modes. An example that comes to mind from this phrasing is that we mean to maximize conscious flourishing, but we accidentally maximize dopamine in large brains.
Of course, this example includes an agent intervening in the provision of its own reward, but since that seems like the paradigmatic example here, maybe the language could better reflect that, or maybe this could be split into two.
The single technical problem that appears biggest to me is that we don’t know how to align an agent with any goal. If we had an indestructible magic box that printed a number to a screen corresponding to the true amount of Good in the world, we still don’t know how to design an agent that maximizes that number (instead of taking over the world, and tampering with the cameras that are aimed at the screen/the optical character recognition program used to decipher the image). This problems seems to me like the single most fundamental source of AI risk. Is 3 meant to include this?
I’m not sure if I meant to include this when I wrote 3, but it does seem like a good idea to break it out into its own item. How would you suggest phrasing it? “Wireheading” or something more general or more descriptive?
I think this phrasing misplaces the likely failure modes. An example that comes to mind from this phrasing is that we mean to maximize conscious flourishing, but we accidentally maximize dopamine in large brains.
Of course, this example includes an agent intervening in the provision of its own reward, but since that seems like the paradigmatic example here, maybe the language could better reflect that, or maybe this could be split into two.
The single technical problem that appears biggest to me is that we don’t know how to align an agent with any goal. If we had an indestructible magic box that printed a number to a screen corresponding to the true amount of Good in the world, we still don’t know how to design an agent that maximizes that number (instead of taking over the world, and tampering with the cameras that are aimed at the screen/the optical character recognition program used to decipher the image). This problems seems to me like the single most fundamental source of AI risk. Is 3 meant to include this?
I’m not sure if I meant to include this when I wrote 3, but it does seem like a good idea to break it out into its own item. How would you suggest phrasing it? “Wireheading” or something more general or more descriptive?
Maybe something along the lines of “Inability to specify any ‘real-world’ goal for an artificial agent”?