Figuring out what Alice wants, part I

This is a very preliminary two-part post sketching out the direction I’m taking my research now (second post here). I’m expecting and hoping that everything in here will get superseded quite quickly. This has obvious connections to classical machine intelligence research areas (such as interpretability). I’d be very grateful for any links with papers or people related to the ideas of this post.

The theory: model fragments

I’ve presented the theoretical argument for why we cannot deduce the preferences of an irrational agent, and a practical example of that difficulty. I’ll be building on that example to illustrate some algorithms that produce the same actions, but where we nonetheless can feel confident deducing different preferences.

I’ve mentioned a few ideas for “normative assumptions”: the assumptions that we, or an AI, could use to distinguish between different possible preferences even if they result in the same behaviour. I’ve mentioned things such as regret, humans stating their values with more or less truthfulness, human narratives, how we categorise our own emotions (those last three are in this post), or the structure of the human algorithm.

Those all seems rather add-hoc, but they are all trying to do the same thing: hone in on human judgement about rationality and preferences. But what is this judgement? This judgement is defined to be the internal models that humans use to assess situations. These models, about ourselves and about other humans, often agree with each other from one human to the next (for instance, most people agree that you’re less rational when you’re drunk).

Calling them models might be a bit of an exaggeration, though. We often only get a fragmentary or momentary piece of a model—“he’s being silly”, “she’s angry”, “you won’t get a promotion with that attitude”. These are called to mind, thought upon, and then swiftly dismissed.

So what we want to access, is the piece of the model that the human used to judge the situation. Now, these model fragments can often be contradictory, but we can deal with that problem later.

Then all the normative assumptions noted above are just ways of defining these model fragments, or accessing them (via emotion, truthful description, or regret). Regret is a particularly useful emotion, as it indicates a divergence between what was expected in the model, and what actually happened (similarly to temporal difference learning).

So I’ll broadly categorise methods of learning human model fragments into three categories:

Direct access to the internal model.
Regret and surprise as showing mismatchs between model expectation and outcomes.
Privileged output (eg certain human statements in certain circumstances are taken to be true-ish statements about the internal model).

The first method violates algorithmic equivalence and extentionality: two algorithms with identical outputs can nevertheless use different models. The second two methods do respect algorithmic equivalence, once we have defined what behaviours correspond to regret/surprise, or what situations humans can be expected to respond truthfully to. In the process of defining those behaviours and situations, however, we are likely to use introspection and our own models: a sober, relaxed rational human confiding confidentially with an impersonal computer, is more likely to be truthful than a precariously employed worker on stage in front of their whole office.

What model fragments look like

The second post will provide examples of the approach, but here I’ll just list the kind of things that we can expect as model fragment:

Direct statements about rewards (“I want chocolate now”).
Direct statements about rationality (“I’m irrational around them”).
An action is deemed better than another (“you should starts a paper trail, rather than just rely on oral instructions”).
An action is seen as good (or bad), compared with some implicit set of standard actions. (“compliment your lover often”).
Similarly to actions, observations/outcomes can be treated as above (“the second prize is actually better”, “it was unlucky you broke your foot”).
An outcome is seen as surprising (“that was the greatest stock market crash in history”), or the action of another agent is seen as that (“I didn’t expect them to move to France”).

A human can think these things about themselves or about other agents; the most complicated variants are assessing the actions of one agent from the perspective of another agent (“if she signed the check, he’d be in a good position”).

Finally, there are meta, and meta-meta, etc… versions of these, as we model other agents modelling us. All of these give a partial indication of our models of the rationality or reward, about ourselves and about other humans.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer