A putative new idea for AI control; index here.

This post is just an initial foray into modelling human irrationality, for the purpose of successful value learning. Its purpose is not to be full model, but have enough details that various common situations can be successfully modelled. The important thing is to model humans in ways that humans can understand (as it’s our definition which determines what’s a bias and what’s a preference in humans).

Humans, actions, and joint distributions

The human themselves is simply modelled as their brain (thus various human sense organs can be observed by the AI rather than being part of the description).

Let $R$ be the set of possible reward functions the human may be maximising. Let $H_{π}$ be the set of policies the human may be following. We’ll assume that $H_{π}$ is closed under the taking of mixed strategies.

The AI has a joint probability distribution $P$ over $R$ , $H_{π}$ and events in the world. By conditioning on any element $r \in R$ , $P$ defines a map $μ$ from $R$ to probability distributions over $H_{π}$ . Since $H_{π}$ is closed under the taking of mixed strategies, this means that $μ$ can be seen as a map from $R$ to $H_{π}$ .

The map $μ$ and the marginal distribution $P_{R}$ ( $P$ restricted to $R$ ) define $P$ entirely. Note that $μ$ is what relates human actions to their explanation in terms of the reward $R$ .

Basic properties of $P$

Here are a few properties $P$ could have:

#. The distribution $P$ is historical if $P_{R}$ is independent of any action the AI takes. #. An AI’s action $a$ overwrites the reward if $μ$ is constant, conditional on $a$ , while $P_{R}$ is still `broad’ (“broad” is not fully defined, but $P_{R}$ is certainly broad enough if it assigns non-zero probability to both an $r$ and $- r$ ). #. The distribution $P$ is $Q$ -rational if there exits a prior distribution $Q$ over the universe such that $μ$ maps $r \in R$ to the optimal policy for an $r$ -maximising agent with prior $Q$ .

It’s clear that if $P$ is historical, the AI will treat the human’s reward function as something it has to discover, and can’t influence. An action $a$ that overwrites the reward means that the human’s policy is fixed by action $a$ , independently of whatever reward it might have. This is bad because a) the human actions are no longer informative to the AI about their reward, and b) the human actions are likely suboptimal with respect

Note that stratification can be seen as taking a non-historical distribution, and making it historical via counterfactual.

Advanced properties of $P$

Those basic properties can define a basic model of a human. But humans have far more biases and irrationalities. Though these are multiple and complicated, we’ll focus here on a few general properties that can capture a lot of these irrationalities in relatively “natural” ways.

By “natural”, we mean human understandable properties that encode biases in ways that are not too complicated and are close to how we understand them.

Bounded rationality: selective updates

Humans are not perfect logical reasoners who fully and immediately know all the infinite implications of any statement. Now, modelling bounded rationality or logical uncertainty is going to be tricky, but we can for the moment simply assume that humans only partially update their probabilities when new data comes in.

Specifically, there is a function $f$ which maps an observation $o$ and previous history $h_{< o}$ to the set of statements that get updated.

Partial updates

Humans don’t tend to update their beliefs in a fully Bayesian fashion. Thus define an update function $u$ which is used instead of Bayesian updates. If we update the odds ratio of event $A$ according to evidence $E$ , the correct Bayesian update is

$\frac{P (A | E)}{P (\neg A | E)} = (\frac{P (E | A)}{P (E | \neg A)}) \frac{P (A)}{P (\neg A)}$ .

Then $u$ could be a function of one variable:

$\frac{P (A | E)}{P (\neg A | E)} = u (\frac{P (E | A)}{P (E | \neg A)}) \frac{P (A)}{P (\neg A)}$ .

Or of two:

$\frac{P (A | E)}{P (\neg A | E)} = u (\frac{P (E | A)}{P (E | \neg A)}, \frac{P (A)}{P (\neg A)}) \frac{P (A)}{P (\neg A)}$ .

Or $u$ could also be a function of the observation $o$ and prior history $h_{< o}$ .

Now, humans update some probabilities better than others, but we’ll defer that to when we talk of multi-agent models.

Bounded rationality: inconceivable actions

Humans don’t fully explore the space of possible actions and policies, preferring to stick to those that are the most easily accessible. So most actions are literally inconceivable to us.

This can be modelled by a function $g$ which maps $o$ and/or $h_{< o}$ to a set of possible actions. This map will determine a subset $G \subset H_{π}$ of possible human policies (these are the policies that only ever take actions compatible with $g$ ).

Multi-agent models

Multi-agent models are useful for modelling the various contradictions in the human psyche (system 1 vs system 2, conscious vs subconscious, short vs long term preferences etc).

This can be modelled as seeing the human as consisting of different agents $A_{0}$ , $A_{1}$ , … $A_{n}$ , each of them with their own possible biases as described above (tohugh they must share a common function $g$ ). There is a function $L$ which, taking $o$ and possibly $h_{< o}$ into account, weights the verdicts of the different subagents and outputs the ultimate action.

These multiple agents may be optimising for different reward functions, so $μ$ can break down into multiple $μ_{i}$ , one for each agent.

Recursion and introspection

This is the most complex situation here. Humans have explicit meta preferences (“I don’t want to be racist”, “I want to be rational”, “I want to be right”, etc...) that influences how they update their beliefs. Generally these meta-preferences view the human as an integrated whole.

We can try and model this by positing meta preferences $M$ which are desirable properties of the agent as a whole, and an introspection function $I$ which will trigger occasionally, and map some feature of the agent closer to $M$ . This is deliberately very vague, and I’ll try and flesh it out and formalise it as needed.

The modelled human

Thus we can define $μ$ as representing an irrational human if:

The prior $P$ is $({A_{i}, f_{i}, u_{i}, r_{i}}, g, L)$ -consistent if the human’s actions are chosen by applying the agent weighting function $L$ to $μ_{i} (r_{i})$ , where $μ_{i}$ maps any $r$ to the action chosen by an agent $A_{i}$ with reward $r$ , bounded rationality $f_{i}$ , partial update $u_{i}$ , and available actions $g$ .
The prior $P$ is $({A_{i}, f_{i}, u_{i}, r_{i}}, g, L, M, I)$ -consistent if it is as above, plus the action of the introspection function $I$ on the meta-preferences $M$ .