avturchin comments on AI Alignment Problem: “Human Values” don’t Actually Exist

avturchin 9 Jul 2019 19:22 UTC
1 point
In short, I am impressed, but not convinced :)
One problem I see is that all information about human psychology should be more explicitly taken into account as some independent input in the model. For example, if we take a model M1 of human mind, in which there are two parts, consciousness and unconsciousness, both of which are centered around mental models with partial preferences—we will get something like your theory. However, there could be another theory M2 well supported by psychological literature, where there will be 3 internal parts (e.g. Libido, Ego, SuperEgo). I am not arguing that M2 is better than M1. I am argue that M should be taken as independent variable (and supported by extensive links of actual psychological and neuroscience research for each M).
In other words, as soon as we define human values as some theory V (there is around 20 theories only between AI safety researcher about V, of which I have in a list), we could create an AI which will learn V. However, internal consistency of the theory V is not the evidence that it is actually good, as other theories about V are also internally consistent. Some way of testing is needed, may in the form in which human could play, so we could check what could go wrong—but to play such game, the preference learning method should be specified in more details.
During reading I was expecting to get more on the procedure of learning partial preferences. However, it was not explained in details and was only (as I remember) mentioned that future AI will able to learn partial preferences by some deep scan methods. But it is too advance method of value learning to be safe. In it we have to give AI very dangerous capabilities like nanotech for brain reading before it will learn human values. So AI could start acting dangerously before it learns all these partial preferences. Other methods of value learning are safer: like an analysis of previously written human literature by some ML, which would extract human norms from it. Probably, some word2vec could do it even now.
Now, it may turn out that I don’t need that AI will know the whole my utility function, I just want it to obey human norms plus do what I said. “Just brink me tee, without killing my cat and tilling universe with teapots.” :)
Another thing which worry me about personal utility function is that it could be simultaneously fragile(in time) and grotesque and underfdefined – at least based on my self-observation. Thus again I would prefer collectively codified human norm (laws) over extrapolated model of my utility function.
- Stuart_Armstrong 9 Jul 2019 19:43 UTC
  3 points
  Parent
  Thanks! For the M1 vs M2, I agree these could reach different outcomes—but would either one be dramatically wrong? There are many “free variables” in the process, aiming to be ok.
  
  I’ll work on learning partial preferences.
  
  “Just brink me tee, without killing my cat and tilling universe with teapots.” [...] and underfdefined – at least based on my self-observation. Thus again I would prefer collectively codified human norm (laws) over extrapolated model of my utility function.
  
  It might be underdefined in some sort of general sense—I understand the feeling, I sometimes get it too. But in practice, it seems like it should ground out to “obey human orders about tea, or do something that is strongly preferred to that by the human”. Humans like their orders being obeyed, and presumably like getting what they’re ordering for; so to disobey that, you’d need to be very sure that there’s a clearly better option for the human.
  
  Of course, it might end up having a sexy server serve pleasantly drugged tea ^_^
  - avturchin 10 Jul 2019 14:56 UTC
    3 points
    Parent
    One more thing: you model assumes that mental models of situations are actually preexisting. However, imagine a preference between tea and coffee. Before I was asked, I don’t have any model and don’t have any preference. So I will generate some random model, like large coffee and small tea, and when make a choice. However, the mental model I generate depends on framing of the question.
    In some sense, here we are passing the buck of complexity from “values” to “mental models”, which are assumed to be stable and actually existing entities. However, we still don’t know what is a separate “mental model”, where it is located in the brain, how it is actually encoded in neurons.
    - Stuart_Armstrong 10 Jul 2019 16:13 UTC
      2 points
      Parent
      The human might have some taste preferences that will determine between tea and coffee, general hedonism preferences that might also work, and meta-preferences about how they should deal with future choices.
      
      Part of the research agenda—“grounding symbols”—about trying to determine where these models are located.