Learning human preferences: black-box, white-box, and structured white-box access
This post is inspired by system identification; however, I’m not an expert in that domain, so any corrections or inspirations on that front are welcome.
I want to thank Rebecca Gorman for her idea on using system identification, and her conversations developing the concept.
Knowing an agent
This is an agent:
Fig. 1We want to know about its internal mechanisms, its software. But there are several things we could mean by that.
Black-box
First of all, we might be interested in knowing its input-output behaviour. I’ve called this its policy in previous posts; a full map that will allow us to predict its output in any circumstances:
Fig. 2I’ll call this black-box knowledge of the agent’s internals.
White-box
We might be interested in knowing more about what’s actually going on in the agent’s algorithm, not just the outputs. I’ll call this white-box knowledge; we would be interested in something like this (along with a detailed understanding of the internals of the various modules):
Fig. 3Structured white-box
And, finally, we might we interested in knowing what the internal modules actually do, or actually mean. This is the semantics of the algorithm, resulting in something like this:
Fig. 4The “beliefs”, “preferences”, and “action selectors” are tags that explain what these modules are doing. The tags are part of the structure of the algorithm, which includes the arrows and setup.
If we know those, I’d call it structured white-box knowledge.
Levels of access
We can have different levels of access to the agent. For example, we might be able to run it inside any environment, but not pry it open; hence we know its full input-output behaviour. This would give us (full) black-box access to the agent (partial black box access would be knowing some of its behaviour, but not in all situations).
Or we might be able to follow its internal structure. This gives us white-box access to the agent. Hence we know its algorithm.
Or, finally, we might have a full tagged and structured diagram of the whole agent. This gives us structured white-box access to the agent (the term is my own).
Things can more complicated, of course. We could have only access to parts of the agent/structure/tags. Or we could have a mix of different types of access—grey-box seems to be the term for something between black-box and white-box.
Humans seem to have a mixture of black-box and structured white-box access to each other—we can observe each other’s behaviour, and we have our internal theory of mind that provides information like “if someone freezes up on a public speaking stage, they’re probably filled with fear”.
Access and knowledge
Complete access at one level gives complete knowledge at that level. So, if you have complete black-box access to the agent, you have complete black-box knowledge: you could, at least in principle, compute every input-output map just by running the agent.
So the interesting theoretical challenges are those that involve having access at one level and trying to infer a higher level, or having partial access at one or multiple levels and trying to infer full knowledge.
Multiple white boxes for a single black box
Black-box and white-box identification are have been studied somewhat extensively in system identification. One fact remains true: there are multiple white-box interpretations of the same black-box access.
We can have the “angels pushing particles to resemble general relativity” situations. We can add useless epicycles, that do nothing, to the model of the white-box; this gives us a more complicated white-box with identical black-box behaviour. Or you could have the matrix mechanics vs wave mechanics situation in quantum mechanics, where two very different formulations were shown to be equivalent.
There are multiple ways of choosing among equivalent white-box models. In system identification, the criteria seems to be “go with what works”: the model is to be identified for a specific purpose (for example, to enable control of a system) and that purpose gives criteria that will select the right kind of model. For example, linear regression will work in many rough-and-ready circumstances, while it would be stupid to use it for calibrating sensitive particle detectors when much better models are available. Different problems have different trade-offs.
Another approach is the so called “grey-box” approach, where a class of models is selected in advance, and this class is updated with the black-box data. Here the investigator is making “modelling assumptions” that cut down on the possible space of white-box models to consider.
Finally, in this community and among some philosophers, algorithmic simplicity is seen as good and principled way of deciding between equivalent white-box models.
Multiple structures and tags for one white-box
A similar issue happens again at a higher level: there are multiple ways of assigning tags to the same white-box system. Take the model in figure 4, and erase all the tags (hence giving us figure 3). Now reassign those tags; there are multiple ways we could tag the modules, and still have the same structure as figure 4:
Fig. 5We might object, at this point, insisting that tags like “beliefs” and “preferences” be assigned to modules for a reason, not just because the structure is correct. But having a good reason to assign those tags is precisely the challenge.
We’ll look more into that issue in future sections, but here I should point out that if we consider the tags as purely syntactic, then we can assign any tag to anything:
Fig. 6What’s “Tuna”? Whatever we want it to be.
And since we haven’t defined the modules or said anything about their size and roles, we can decompose the interior of the modules and assign tag in completely different ways:
Fig. 7Normative assumptions, tags, and structural assumptions
We need to do better than that. Paper “Occam’s razor is insufficient to infer the preferences of irrational agents” talked about “normative assumptions”, assumptions about the values (or the biases) of the agent.
In this more general setting, I’ll refer to them as “structural assumptions”, as they can refer to beliefs, or other features of the internal structure and tags of the agent.
Almost trivial structural assumptions
These structural assumptions can be almost trivial; for example, saying “beliefs nad preferences update from knowledge, and update the action selector”, is enough to rule out figures 6 and 7. This is equivalent with starting with figure 4, erasing the tags, and wanting to reassign tags to the algorithm while ensuring the graph is isomorphic to figure 4. Hence we have a “desired graph” that we want to fit our algorithm into.
What the Occam’s razor paper shows is that we can’t get good results from “desired graph + simplicity assumptions”. This is unlike the black-box to white-box transition, where simplicity assumptions are very effective on their own.
Figure 5 demonstrated that above: the beliefs and preference modules can be tagged as each other, and we can still get the same desired graph. Even worse, since we still haven’t specified anything about the size of these modules, the following tag assignment is also possible. Here, the belief and preference “module” have been reduced to mere conduits, that pass on the information to the action selector, that has expanded to gobble up all of the rest of the agent.
Fig. 8Note that this decomposition is simpler than a “reasonable” version of figure 4, since the boundaries between the three modules don’t need to be specified. Hence algorithmic simplicity will tend to select these degenerate structures more often. Note this is almost exactly the “indifferent planner” of the Occam’s razor paper, one of the three simple degenerate structures. The other two—the greedy and anti-greedy planners—are situations where the “Preferences” module has expanded to full size, with the action selector reduced to a small appendage.
Adding semantics or “thick” concepts
To avoid those problems, we need to flesh out the concepts of “beliefs”, “preferences[1]”, and so on. The more structural assumptions we put on these concepts, the more we can avoid degenerate structured white-box solutions[2].
So we want something closer to our understanding of preferences and beliefs. For example, preferences are supposed to change much more slowly than beliefs. So the impact of observations on the preference module—in an information-theoretic sense, maybe—would be much lower than on the beliefs modules, or at least much slower. Adding that as a structural assumption cuts down on the number of possible structured white-box solutions.
And it we are dealing with humans, trying to figure out their preference—which is my grand project at this time—then we can add a lot of other structural assumptions. “Situation X is one that updates preferences”; “this behaviour shows a bias”; “sudden updates in preferences are accompanied by large personal crises”; “red faces and shouting denotes anger”, etc...
Basically any judgement we can make about human preferences can be used, if added explicitly, to restrict the space of possible structured white-box solutions. But these need to be added in explicitly at some level, not just deduced from observations (ie supervised, not unsupervised learning), since observations can only get you as far as white-box knowledge.
Note the similarity with semantically thick concepts and with my own post on getting semantics empirically. Basically, we want an understanding of “preferences” that is so rich that only something that is clearly a “preference” can fit the model.
In the optimistic scenario, a few such structural assumptions are enough to enable an algorithm to quickly grasp human theory of mind and quickly sort our brain into plausible modules, and hence isolate our preferences. In the pessimistic scenario, theory of mind, preferences, beliefs, and biases are all so twisted together that even extensive examples are not enough to decompose them. See more in this post.
- ↩︎
We might object to the arrow from observations to “preferences”: preferences are not supposed to change, at least for ideal agents. But many agents are far from ideal (including humans); we don’t want the whole method to fail because there was a stray bit of code or neuron going in one direction, or because two modules reused the same code or the same memory space.
- ↩︎
Note that I don’t give a rigid distinction between syntax and semantics/meaning/”ground truth”. As we accumulate more and more syntactical restrictions, the number of plausible semantic structures plunges.
- AI Safety Research Project Ideas by 21 May 2021 13:39 UTC; 58 points) (
- Training Process Transparency through Gradient Interpretability: Early experiments on toy language models by 21 Jul 2023 14:52 UTC; 56 points) (
- Immobile AI makes a move: anti-wireheading, ontology change, and model splintering by 17 Sep 2021 15:24 UTC; 32 points) (
- Syntax, semantics, and symbol grounding, simplified by 23 Nov 2020 16:12 UTC; 30 points) (
- AI learns betrayal and how to avoid it by 30 Sep 2021 9:39 UTC; 30 points) (
- How an alien theory of mind might be unlearnable by 3 Jan 2022 11:16 UTC; 29 points) (
- Are there alternative to solving value transfer and extrapolation? by 6 Dec 2021 18:53 UTC; 20 points) (
- Force neural nets to use models, then detect these by 5 Oct 2021 11:31 UTC; 17 points) (
- Anthropomorphisation vs value learning: type 1 vs type 2 errors by 22 Sep 2020 10:46 UTC; 16 points) (
- Preferences from (real and hypothetical) psychology papers by 6 Oct 2021 9:06 UTC; 15 points) (
- Finding the multiple ground truths of CoinRun and image classification by 8 Dec 2021 18:13 UTC; 15 points) (
- Preferences and biases, the information argument by 23 Mar 2021 12:44 UTC; 14 points) (
- AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism by 20 Sep 2021 11:56 UTC; 14 points) (
- Dehumanisation *errors* by 23 Sep 2020 9:51 UTC; 13 points) (
- Toy model of preference, bias, and extra information by 24 Mar 2021 10:14 UTC; 9 points) (
I’m not so sure about the “labeled white box” framing. It presupposes that the thing we care about (e.g. preferences) is part of the model. An alternative possibility is that the model has parameters a,b,c,d,… and there’s a function f with
preferences = f(a,b,c,d,...),
but the function f is not part of the algorithm, it’s only implemented by us onlookers. Right?
Then isn’t that just a model at another level, a (labelled) model in the heads of the onlookers?
Any model is going to be in the head of some onlooker. This is the tough part about the white box approach: it’s always an inference about what’s “really” going on. Of course, this is true even of the boundaries of black boxes, so it’s a fully general problem. And I think that suggests it’s not a problem except insofar as we have normal problems setting up correspondence between map and territory.
My understanding of the OP was that there is a robot, and the robot has source code, and “black box” means we don’t see the source code but get an impenetrable binary and can do tests of what its input-output behavior is, and “white box” means we get the source code and run it step-by-step in debugging mode but the names of variables, functions, modules, etc. are replaced by random strings. We can still see the structure of the code, like “module A calls module B”. And “labeled white box” means we get the source code along with well-chosen names of variables, functions, etc.
Then my question was: what if none of the variables, functions, etc. corresponds to “preferences”? What if “preferences” is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot’s programmer?
But now this conversation is suggesting that I’m not quite understanding it right. “Black box” is what I thought, but “white box” is any source code that produces the same input-output behavior—not necessarily the robot’s actual source code—and that includes source code that does extra pointless calculations internally. And then my question doesn’t really make sense, because whatever “preferences” is, I can come up a white-box model wherein “preferences” is calculated and then immediately deleted, such that it’s not part of the input-output behavior.
Something like that?
That understanding is correct.
I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also possible that another labelling would be clearer or more useful for our purposes. It might be a “natural” abstraction, once we’ve put some effort into defining what preferences “naturally” are.
What that section is saying is that there are multiple white boxes that produce the same black box behaviour (hence we cannot read the white box simply from the black box).
Consider two versions of the same program. One makes use of a bunch of copy/pasted code. The other makes use of a nice set of re-usable abstractions. The second program will be shorter/simpler.
Boundaries between modules don’t cost very much, and modularization is super helpful for simplifying things.
The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent’s algorithm (that’s the “Occam’s razor” result).
Let’s say I’m trying to describe a hockey game. Modularizing the preferences from other aspects of the team algorithm makes it much easier to describe what happens at the start of the second period, when the two teams switch sides.
The fact that humans find an abstraction useful is evidence that an AI will as well. The notion that agents have preferences helps us predict how people will change their plans for achieving their goals when they receive new information. Same for an AI.
Humans have a theory of mind, that makes certain types of modularizations easier. That doesn’t mean that the same modularization is simple for an agent that doesn’t share that theory of mind.
Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there’s an informal equivalence result; if one of those is easy to deduce, all the others are).
So we need to figure out if we’re in the optimistic or the pessimistic scenario.