The starting point is that future states can be better or worse from the perspective of people (or any other evolved creature). Maybe it’s not totally ordered, but a future where I’m getting tortured and everyone hates me is definitely worse than a future where I’m feeling great and everyone loves me.
This is important because I think coherence (the way people use the term) only makes sense when there are preferences about future states—not preferences about trajectories, or preferences about actions, or preferences about decisions. Like, maybe I think it’s a fun game to make the waiter keep switching my pizza for hours straight, well worth the few dollars that I lose. Or maybe I think it’s always deontologically proper to give the answer “Yes I’ll switch pizzas” when a waiter asks me if I want to switch pizzas, regardless of the exact pizzas that they’re asking me about.
OK, now so far we have people with preferences about future states (possibly among other preferences). Now those people make AIs. Presumably they’ll correspondingly build them and train them to actualize those preferences about future states.
So from a certain perspective, we already have our answer:
We shouldn’t expect an AI to wind up concluding that all future states are equally good, because the designers don’t want that to happen, and they’ll presumably design the AI accordingly.
But let’s drill down a bit and ask how. The answer of course depends on the AI’s algorithms. Let’s go with what happens in the human brain (at least, how I think the human brain works), as a possible architecture of a general intelligence.
It’s not computationally feasible to conceptualize the whole world in a chunk. Instead our understanding of the world is built up from lots of little compositional pieces—little predictive models, but (like Logical Induction) the models aren’t predicting every aspect of the world simultaneously, they pattern-match some aspect of what’s going on (in either sensory inputs or thoughts), then activate, then make one or more narrow predictions about some aspect of what’s going to happen next. Like there’s a pattern “the ball is falling and it’s going to hit the floor and then bounce back up”, and this pattern is agnostic about color of the ball, how far away it is, how hungry I am, etc.
Then our plans are also built up from these little compositional pieces, and (to a first approximation) we come to have preferecnes about those pieces: some pieces are good (like “I’m impressing my friends”) and some pieces are bad (like “this is painful”) . But (again kinda like logical induction), there’s no iron law enforcing global consistency of these preferences a priori. Instead there are processes that generally tend to drive preferences towards consistency. What are these processes?
Let’s go through an example: a particular plausible human circular preference (based loosely on an example in Thinking Fast And Slow chapter 15). You won a prize! Your three options are:
(A) 5 lovely plates (B) 5 lovely plates and 10 ugly plates (C) 5 OK plates
No one has done this exact experiment to my knowledge, but plausibly (based on the book discussion) this is a circular preference in at least many people: When people see just A & B, they’ll pick B because “it’s more stuff, I can always keep the ugly ones as spares or use them for target practice or whatever”. When they see just B & C, they’ll pick C because “the average quality is higher”. When they see just C & A, they’ll likewise pick A because “the average quality is higher”.
So what we have is two different preferences (“I want to have a prettier collection of stuff, not an uglier collection”, and “I want extra free plates”), and different comparisons / situations make different aspects salient.
Again, what’s happening is that it’s computationally intractable to hold “an entire future situation” in our mind; we need to attend to certain aspects of it and not others. So we’re naturally going to be prone to circular preferences by default.
OK, then what happens if you actually try to set up the money pump? You offer A, then $0.25 to switch to B, then $0.25 to switch to C, etc. I think the person would quickly catch on, because they’ll also do the comparison “three steps ago versus now”, and they’ll notice that (unless switching is inherently fun as discussed above) they’re now doing worse in every respect. And thus, they should stop going around in circles.
Basically, “other things equal, I prefer to have more money” (i.e. don’t get money-pumped) is also a preference about future states, and in fact it’s a “default” preference because of instrumental convergence.
Or another way of looking at it is: you can (and naturally do) have a preference “Insofar as my other preferences are self-contradictory, I should try to reduce that aspect of them”, because this is roughly a Pareto-improving thing to do. All of my preferences about future states can be better-actualized simultaneously when I adopt the habit of “noticing when two of my preferences are working at cross-purposes, and when I recognize that happening, preventing them from doing so”. So you gradually build up a bunch of new habits that look for various types of situations that pattern-match to “I’m working at cross-purposes to myself”, and then execute a Pareto improvement—since these habits are by default positively reinforced. It’s loosely analogous to how markets become more self-consistent when a bunch of people are scouting out for arbitrage opportunities.
The starting point is that future states can be better or worse from the perspective of people (or any other evolved creature). Maybe it’s not totally ordered, but a future where I’m getting tortured and everyone hates me is definitely worse than a future where I’m feeling great and everyone loves me.
This is important because I think coherence (the way people use the term) only makes sense when there are preferences about future states—not preferences about trajectories, or preferences about actions, or preferences about decisions. Like, maybe I think it’s a fun game to make the waiter keep switching my pizza for hours straight, well worth the few dollars that I lose. Or maybe I think it’s always deontologically proper to give the answer “Yes I’ll switch pizzas” when a waiter asks me if I want to switch pizzas, regardless of the exact pizzas that they’re asking me about.
OK, now so far we have people with preferences about future states (possibly among other preferences). Now those people make AIs. Presumably they’ll correspondingly build them and train them to actualize those preferences about future states.
So from a certain perspective, we already have our answer:
We shouldn’t expect an AI to wind up concluding that all future states are equally good, because the designers don’t want that to happen, and they’ll presumably design the AI accordingly.
But let’s drill down a bit and ask how. The answer of course depends on the AI’s algorithms. Let’s go with what happens in the human brain (at least, how I think the human brain works), as a possible architecture of a general intelligence.
It’s not computationally feasible to conceptualize the whole world in a chunk. Instead our understanding of the world is built up from lots of little compositional pieces—little predictive models, but (like Logical Induction) the models aren’t predicting every aspect of the world simultaneously, they pattern-match some aspect of what’s going on (in either sensory inputs or thoughts), then activate, then make one or more narrow predictions about some aspect of what’s going to happen next. Like there’s a pattern “the ball is falling and it’s going to hit the floor and then bounce back up”, and this pattern is agnostic about color of the ball, how far away it is, how hungry I am, etc.
Then our plans are also built up from these little compositional pieces, and (to a first approximation) we come to have preferecnes about those pieces: some pieces are good (like “I’m impressing my friends”) and some pieces are bad (like “this is painful”) . But (again kinda like logical induction), there’s no iron law enforcing global consistency of these preferences a priori. Instead there are processes that generally tend to drive preferences towards consistency. What are these processes?
Let’s go through an example: a particular plausible human circular preference (based loosely on an example in Thinking Fast And Slow chapter 15). You won a prize! Your three options are:
(A) 5 lovely plates
(B) 5 lovely plates and 10 ugly plates
(C) 5 OK plates
No one has done this exact experiment to my knowledge, but plausibly (based on the book discussion) this is a circular preference in at least many people: When people see just A & B, they’ll pick B because “it’s more stuff, I can always keep the ugly ones as spares or use them for target practice or whatever”. When they see just B & C, they’ll pick C because “the average quality is higher”. When they see just C & A, they’ll likewise pick A because “the average quality is higher”.
So what we have is two different preferences (“I want to have a prettier collection of stuff, not an uglier collection”, and “I want extra free plates”), and different comparisons / situations make different aspects salient.
Again, what’s happening is that it’s computationally intractable to hold “an entire future situation” in our mind; we need to attend to certain aspects of it and not others. So we’re naturally going to be prone to circular preferences by default.
OK, then what happens if you actually try to set up the money pump? You offer A, then $0.25 to switch to B, then $0.25 to switch to C, etc. I think the person would quickly catch on, because they’ll also do the comparison “three steps ago versus now”, and they’ll notice that (unless switching is inherently fun as discussed above) they’re now doing worse in every respect. And thus, they should stop going around in circles.
Basically, “other things equal, I prefer to have more money” (i.e. don’t get money-pumped) is also a preference about future states, and in fact it’s a “default” preference because of instrumental convergence.
Or another way of looking at it is: you can (and naturally do) have a preference “Insofar as my other preferences are self-contradictory, I should try to reduce that aspect of them”, because this is roughly a Pareto-improving thing to do. All of my preferences about future states can be better-actualized simultaneously when I adopt the habit of “noticing when two of my preferences are working at cross-purposes, and when I recognize that happening, preventing them from doing so”. So you gradually build up a bunch of new habits that look for various types of situations that pattern-match to “I’m working at cross-purposes to myself”, and then execute a Pareto improvement—since these habits are by default positively reinforced. It’s loosely analogous to how markets become more self-consistent when a bunch of people are scouting out for arbitrage opportunities.