Figuring out what Alice wants, part I
This is a very preliminary two-part post sketching out the direction I’m taking my research now (second post here). I’m expecting and hoping that everything in here will get superseded quite quickly. This has obvious connections to classical machine intelligence research areas (such as interpretability). I’d be very grateful for any links with papers or people related to the ideas of this post.
The theory: model fragments
I’ve presented the theoretical argument for why we cannot deduce the preferences of an irrational agent, and a practical example of that difficulty. I’ll be building on that example to illustrate some algorithms that produce the same actions, but where we nonetheless can feel confident deducing different preferences.
I’ve mentioned a few ideas for “normative assumptions”: the assumptions that we, or an AI, could use to distinguish between different possible preferences even if they result in the same behaviour. I’ve mentioned things such as regret, humans stating their values with more or less truthfulness, human narratives, how we categorise our own emotions (those last three are in this post), or the structure of the human algorithm.
Those all seems rather add-hoc, but they are all trying to do the same thing: hone in on human judgement about rationality and preferences. But what is this judgement? This judgement is defined to be the internal models that humans use to assess situations. These models, about ourselves and about other humans, often agree with each other from one human to the next (for instance, most people agree that you’re less rational when you’re drunk).
Calling them models might be a bit of an exaggeration, though. We often only get a fragmentary or momentary piece of a model—“he’s being silly”, “she’s angry”, “you won’t get a promotion with that attitude”. These are called to mind, thought upon, and then swiftly dismissed.
So what we want to access, is the piece of the model that the human used to judge the situation. Now, these model fragments can often be contradictory, but we can deal with that problem later.
Then all the normative assumptions noted above are just ways of defining these model fragments, or accessing them (via emotion, truthful description, or regret). Regret is a particularly useful emotion, as it indicates a divergence between what was expected in the model, and what actually happened (similarly to temporal difference learning).
So I’ll broadly categorise methods of learning human model fragments into three categories:
Direct access to the internal model.
Regret and surprise as showing mismatchs between model expectation and outcomes.
Privileged output (eg certain human statements in certain circumstances are taken to be true-ish statements about the internal model).
The first method violates algorithmic equivalence and extentionality: two algorithms with identical outputs can nevertheless use different models. The second two methods do respect algorithmic equivalence, once we have defined what behaviours correspond to regret/surprise, or what situations humans can be expected to respond truthfully to. In the process of defining those behaviours and situations, however, we are likely to use introspection and our own models: a sober, relaxed rational human confiding confidentially with an impersonal computer, is more likely to be truthful than a precariously employed worker on stage in front of their whole office.
What model fragments look like
The second post will provide examples of the approach, but here I’ll just list the kind of things that we can expect as model fragment:
Direct statements about rewards (“I want chocolate now”).
Direct statements about rationality (“I’m irrational around them”).
An action is deemed better than another (“you should starts a paper trail, rather than just rely on oral instructions”).
An action is seen as good (or bad), compared with some implicit set of standard actions. (“compliment your lover often”).
Similarly to actions, observations/outcomes can be treated as above (“the second prize is actually better”, “it was unlucky you broke your foot”).
An outcome is seen as surprising (“that was the greatest stock market crash in history”), or the action of another agent is seen as that (“I didn’t expect them to move to France”).
A human can think these things about themselves or about other agents; the most complicated variants are assessing the actions of one agent from the perspective of another agent (“if she signed the check, he’d be in a good position”).
Finally, there are meta, and meta-meta, etc… versions of these, as we model other agents modelling us. All of these give a partial indication of our models of the rationality or reward, about ourselves and about other humans.
- Future directions for ambitious value learning by 11 Nov 2018 15:53 UTC; 48 points) (
- Alignment Newsletter #16: 07/23/18 by 23 Jul 2018 16:20 UTC; 42 points) (
- Figuring out what Alice wants, part II by 17 Jul 2018 13:59 UTC; 17 points) (
- Partial preferences and models by 19 Mar 2019 16:29 UTC; 12 points) (
Moved back to drafts, given that I am 70% confident that this is still a draft (or maybe it’s some kind of game where I am supposed to figure out what Alice wants based on the sentence fragments in this post, feel free to move it back in that case).
Ooops! Sorry, this is indeed a draft.
Herein I’m thinking about this and the sequel post and trying to understand why you might be interested in this since it doesn’t feel to me like you spell it out.
It seems we might care about model fragments if we think we can’t build complete models of other agents/things but can instead build partial models. The “we” building these models might be literally us, but also an AI or a composite agent like humanity. Having a theory of what to do with these model fragments is useful if we want to address at least two questions, then, that we might be worried about around these parts: how do we decide an AI is safe based on our fragmentary models of it, and how does an AI model humanity based on its fragmentary models of humans.
I’m looking at how humans model each other based on their fragmentary models, and using this to get to their values.
Thinking a bit more, it seems a big problem we may face in using model fragments is that they are fragments and we will have to find a way to stitch them together so that they fill the gaps between the models, perhaps requiring something like model interpolation. Of course, maybe this isn’t necessary if we think of fragments as mostly overlapping (although probably inconsistent in the overlaps) or of new fragments to fill gaps as available on demand if we discover we need them and don’t have them.
For contradictions: https://www.lesswrong.com/posts/Y2LhX3925RodndwpC/resolving-human-values-completely-and-adequately
I suspect dealing adequately with contradictions will be significantly more complicated than you propose, but haven’t written about that in depth yet. When I get around to addressing what I view as necessary in this area (practicing moral particularism that will be robust to false positives) I definitely look forward to talking with you more about it.
I agree with you to some extent. That post is mainly a placeholder that tells me that the contradictions problem is not intrinsically unsolvable, so I can put it aside while I concentrate on this problem for the moment.