ViktoriaMalyasova comments on Alignment via manually implementing the utility function

ViktoriaMalyasova 4 Dec 2021 12:18 UTC
1 point
Another problem is that the system cannot represent and communicate the whole predicted future history of the universe to us. It has to choose some compact description. And the description can get a high evaluation both for being a genuinely good plan, or for neglecting to predict or mention bad outcomes and using persuasive language (if it’s a natural language description).
Maybe we can have the human also report his happiness daily, and have the make_readable subroutine rewarded solely for how good the plan evaluation given beforehand matches the happiness level reported afterwards? I don’t think that solves the problem of delayed negative consequences, or bad consequences human will never learn about, or wireheading the human while using misleading descriptions of what’s happening, though.
- Chantiel 23 Mar 2022 14:07 UTC
  1 point
  Parent
  
  Another problem is that the system cannot represent and communicate the whole predicted future history of the universe to us.
  
  This is a good point and one that I, foolishly, hadn’t considered.
  
  However, it seems to me that there is a way to get around this. Specifically, just provide the query-answerers the option to refuse to evaluate the utility of a description of a possible future. If this happens, the AI won’t be able to have its utility function return a value for such a possible future.
  
  To see how to do this, note that if a description of a possible future world is too large for the human to understand, then the human can refuse to provide a utility for it.
  
  Similarly, if the description of the future doesn’t specify the future with sufficient detail that the person can clearly tell if the described outcome would be good, then the person can also refuse to return a value.
  
  For example, suppose you are making an AI designed to make paperclips. And suppose the AI queries the person asking for the utility of the possible future described by, “The AI makes a ton of paperclips”. Then the person could refuse to answer, because the description is insufficient to specify the quality of the outcome, for example, because it doesn’t say whether or not Earth got destroyed.
  
  Instead, a possible future would only be rated as high utility if it says something like,”The AI makes a ton of paperclips, and the world isn’t destroyed, and the AI doesn’t take over the world, and no creatures get tortured anywhere in our Hubble sphere, and creatures in the universe are generally satisfied”.
  
  Does this make sense?
  
  I, of course, could always be missing something.
  
  (Sorry for the late response)