Right. I’m not saying that there aren’t things about the AI that make it behave the way it does; what the AI optimizes for is a deterministic result of its properties plus environment. I’m just saying that something about the environment might be necessary for it to have the sorts of preferences we can most usefully model it as having; and/or there may be multiple equally good candidates for the parts of the AI that are its values, or their encoding. If we reify preferences in an uncautious way, we’ll start thinking of the AI’s ‘desires’ too much as its first-person-experienced urges, as opposed to just thinking of them as the effect the local system we’re talking about tends to have on the global system.
So, all right. Cconsider two systems, S1 and S2, both of which happen to be constructed in such a way that right now, they are maximizing the number of things in their environment that appear blue to human observers, by going around painting everything blue.
Suppose we add to the global system a button that alters all human brains so that everything appears blue to us, and we find that S1 presses the button and stops painting, and S2 ignores the button and goes on painting.
Suppose that similarly, across a wide range of global system changes, we find that S1 consistently chooses the action that maximizes the number of things in its environment that appear blue to human observers, while S2 consistently goes on painting.
I agree with you that if I reify S2′s preferences in an uncautious way, I might start thinkng of S2 as “wanting to paint things blue” or “wanting everything to be blue” or “enjoying painting things blue” or as having various other similar internal states that might simply not exist, and that I do better to say it has a particular effect on the global system. S2 simply paints things blue; whether it has the goal of painting things blue or not, I have no idea.
I am far less comfortable saying that S1 has no goals, precisely because of how flexibly and consistently it is revising its actions so as to consistently create a state-change across wide ranges of environments. To use Dennett’s terminology, I am more willing to adopt an intentional stance with respect to S1 than S2.
If I’ve understood your position correctly, you’re saying that I’m unjustified in making that distinction… that to the extent that we can say that S1 and S2 have “goals,” the word “goals” simply refer to the state changes they create in the world. Initially they both have the goal of painting things blue, but S1′s goals keep changing: first it paints things blue, then it presses a button, then it does other things. And, sure, I can make up some story like “S1 maximizes the number of things in its environment that appear blue to human observers, while S2 just paints stuff blue” and that story might even have predictive power, but I ought not fall into the trap of reifying some actual thing that corresponds to those notional “goals”.
I think you’re switching back and forth between a Rational Choice Theory ‘preference’ and an Ideal Self Theory ‘preference’. To disambiguate, I’ll call the former R-preferences and the latter I-preferences. My R-preferences—the preferences you’d infer I had from my behaviors if you treated me as a rational agent—are extremely convoluted, indeed they need to be strongly time-indexed to maintain consistency. My I-preferences are the things I experience a desire for, whether or not that desire impacts my behavior. (Or they’re the things I would, with sufficient reflective insight and understanding into my situation, experience a desire for.)
We have no direct evidence from your story addressing whether S1 or S2 have I-preferences at all. Are they sentient? Do they create models of their own cognitive states? Perhaps we have a little more evidence that S1 has I-preferences than that S2 does, but only by assuming that a system whose goals require more intelligence or theory-of-mind will have a phenomenology more similar to a human’s. I wouldn’t be surprised if that assumption turns out to break down in some important ways, as we explore more of mind-space.
But my main point was that it doesn’t much matter what S1 or S2′s I-preferences are, if all we’re concerned about is what effect they’ll have on their environment. Then we should think about their R-preferences, and bracket exactly what psychological mechanism is resulting in their behavior, and how that psychological mechanism relates to itself.
I’ve said that R-preferences are theoretical constructs that happen to be useful a lot of the time for modeling complex behavior; I’m not sure whether I-preferences are closer to nature’s joints.
Initially they both have the goal of painting things blue, but S1′s goals keep changing: first it paints things blue, then it presses a button, then it does other things.
S1′s instrumental goals may keep changing, because its circumstances are changing. But I don’t think its terminal goals are changing. The only reason to model it as having two completely incommensurate goal sets at different times would be if there were no simple terminal goal that could explain the change in instrumental behavior.
I don’t think I’m switching back and forth between I-preferences and R-preferences.
I don’t think I’m talking about I-preferences at all, nor that I ever have been.
I completely agree with you that they don’t matter for our purposes here, so if I am talking about them, I am very very confused. (Which is certainly possible.)
But I don’t think that R-preferences (preferences, goals, etc.) can sensibly be equated with the actual effects a local system has on a global system. If they could, we could talk equally sensibly about earthquakes having R-preferences (preferences, goals, etc.), and I don’t think it’s sensible to talk that way.
R-preferences (preferences, goals, etc.) are, rather, internal states of a system S.
If S is a competent optimizer (or “rational agent,” if you prefer) with R-preferences (preferences, goals, etc.) P, the existence of P will cause S to behave in ways that cause isomorphic effects (E) on a global system, so we can use observations of E as evidence of P (positing that S is a competent optimizer) or as evidence that S is a competent optimizer (positing the existence of P) or a little of both.
But however we slice it, P is not the same thing as E, E is merely evidence of P’s existence. We can infer P’s existence in other ways as well, even if we never observe E… indeed, even if E never gets produced. And the presence or absence of a given P in S is something we can be mistaken about; there’s a fact of the matter.
I think you disagree with the above paragraph, because you describe R-preferences (preferences, goals, etc.) as theoretical constructs rather than parts of the system, which suggests that there is no fact of the matter… a different theoretical approach might never include P, and it would not be mistaken, it would just be a different theoretical approach.
I also think that because way back at the beginning of this exchange when I suggested “paint everything red AND paint everything blue” was an example of an incoherent goal (R-preference, preference, P), your reply was that it wasn’t a goal at all, since that state can’t actually exist in the world. Which suggests that you don’t see goals as internal states of optimizers and that you do equate P with E.
This is what I’ve been disputing from the beginning.
But to be honest, I’m not sure whether you disagree or not, as I’m not sure we have yet succeeded in actually engaging with one another’s ideas in this exchange.
But I don’t think that R-preferences (preferences, goals, etc.) can sensibly be equated with the actual effects a local system has on a global system. If they could, we could talk equally sensibly about earthquakes having R-preferences (preferences, goals, etc.), and I don’t think it’s sensible to talk that way.
You can treat earthquakes and thunderstorms and even individual particles as having ‘preferences’. It’s just not very useful to do so, because we can give an equally simple explanation for what effects things like earthquakes tend to have that is more transparent about the physical mechanism at work. The intentional strategy is a heuristic for black-boxing physical processes that are too complicated to usefully describe in their physical dynamics, but that can be discussed in terms of the complicated outcomes they tend to promote.
(I’d frame it: We’re exploiting the fact that humans are intuitively dualistic by taking the non-physical modeling device of humans (theory of mind, etc.) and appropriating this mental language and concept-web for all sorts of systems whose nuts and bolts we want to bracket. Slightly regimented mental concepts and terms are useful, not because they apply to all the systems we’re talking about in the same way they were originally applied to humans, but because they’re vague in ways that map onto the things we’re uncertain about or indifferent to.)
‘X wants to do Y’ means that the specific features of X tend to result in Y when its causal influence is relatively large and direct. But, for clarity’s sake, we adopt the convention of only dropping into want-speak when a system is too complicated for us to easily grasp in mechanistic terms why it’s having these complex effects, yet when we can predict that, whatever the mechanism happens to be, it is the sort of mechanism that has those particular complex effects.
Thus we speak of evolution as an optimization process, as though it had a ‘preference ordering’ in the intuitively human (i.e., I-preference) sense, even though in the phenomenological sense it’s just as mindless as an earthquake. We do this because black-boxing the physical mechanisms and just focusing on the likely outcomes is often predictively useful here, and because the outcomes are complicated and specific. This is useful for AIs because we care about the AI’s consequences and not its subjectivity (hence we focused on R-preference), and because AIs are optimization processes of even greater complex specificity in mechanism and outcome than evolution (hence we adopted the intentional stance of ‘preference’-talk in the first place).
R-preferences (preferences, goals, etc.) are, rather, internal states of a system S.
I agree this is often the case, because when we define ‘what is this system capable of?’ we often hold the system fixed while examining possible worlds where the environment varies in all kinds of ways. But if the possible worlds we care about all have a certain environmental feature in common—say, because we know in reality that the environmental condition obtains, and we’re trying to figure out all the ways the AI might in fact behave given different values for the variables we don’t know about with confidence—then we may, in effect, include something about the environment ‘in the AI’ for the purposes of assessing its optimization power and/or preference ordering.
For instance, we might model the AI as having the preference ‘surround the Sun with a dyson sphere’ rather than ‘conditioned on there being a Sun, surround it with a dyson sphere’; if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun. Does that mean the Sun is a part of the AI’s preference encoding? Is the Sun a component of the AI? I don’t think these questions are important or interesting, so I don’t want us to be too committed to reifying AI preferences. They’re just a useful shorthand for the expected outcomes of the AI’s distinguishing features having a more large and direct causal impact on things.
‘X wants to do Y’ means that the specific features of X tend to result in Y when its causal influence is relatively large and direct. But, for clarity’s sake, we adopt the convention of only dropping into want-speak when a system is too complicated for us to easily grasp in mechanistic terms why it’s having these complex effects
Yes, agreed, for some fuzzy notion of “easily grasp” and “too complicated.” That is, there’s a sense in which thunderstorms are too complicated for me to describe in mechanistic terms why they’re having the effects they have… I certainly can’t predict those effects. But there’s also a sense in which I can describe (and even predict) the effects of a thunderstorm that feels simple, whereas I can’t do the same thing for a human being without invoking “want-speak”/intentional stance.
I’m not sure any of this is [i]justified[/i], but I agree that it is what we do… this is how we speak, and we draw these distinctions. So far, so good.
if the possible worlds we care about all have a certain environmental feature in common [..] we may, in effect, include something about the environment ‘in the AI’
I’m not really sure what you mean by “in the AI” here, but I guess I agree that the boundary between an agent and its environment is always a fuzzy one. So, OK, I suppose we can include things about the environment “in the AI” if we choose. (I can similarly choose to include things about the environment “in myself.”) So far, so good.
we might model the AI as having the preference ‘surround the Sun with a dyson sphere’ rather than ‘conditioned on there being a Sun, surround it with a dyson sphere’; if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun.
Here is where you lose me again… once again you talk as though there’s simply no fact of the matter as to which preference the AI has, merely our choice as to how we model it.
But it seems to me that there are observations I can make which would provide evidence one way or the other. For example, if it has the preference ‘surround the Sun with a dyson sphere,’ then in an environment lacking the Sun I would expect it to first seek to create the Sun… how else can it implement its preferences? Whereas if it has the preference ‘conditioned on there being a Sun, surround it with a dyson sphere’; in an environment lacking the Sun I would not expect it to create the Sun.
So does the AI seek create the Sun in such an environment, or not? Surely that doesn’t depend on how I choose to model it. The AI’s preference is whatever it is, and controls its behavior. Of course, as you say, if the real world always includes a sun, then I might not be able to tell which preference the AI has. (Then again I might… the test I describe above isn’t the only test I can perform, just the first one I thought of, and other tests might not depend on the Sun’s absence.)
But whether I can tell or not doesn’t affect whether the AI has the preference or not.
if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun
Again, no. Regardless of how we model it, the system’s preference is what it is, and we can study the system (e.g., see whether it creates the Sun) to develop more accurate models of its preferences.
Does that mean the Sun is a part of the AI’s preference encoding? Is the Sun a component of the AI? I don’t think these questions are important or interesting
I agree. But I do think the question of what the AI (or, more generally, an optimizing agent) will do in various situations is interesting, and it seems to be that you’re consistently eliding over that question in ways I find puzzling.
Right. I’m not saying that there aren’t things about the AI that make it behave the way it does; what the AI optimizes for is a deterministic result of its properties plus environment. I’m just saying that something about the environment might be necessary for it to have the sorts of preferences we can most usefully model it as having; and/or there may be multiple equally good candidates for the parts of the AI that are its values, or their encoding. If we reify preferences in an uncautious way, we’ll start thinking of the AI’s ‘desires’ too much as its first-person-experienced urges, as opposed to just thinking of them as the effect the local system we’re talking about tends to have on the global system.
Hm.
So, all right. Cconsider two systems, S1 and S2, both of which happen to be constructed in such a way that right now, they are maximizing the number of things in their environment that appear blue to human observers, by going around painting everything blue.
Suppose we add to the global system a button that alters all human brains so that everything appears blue to us, and we find that S1 presses the button and stops painting, and S2 ignores the button and goes on painting.
Suppose that similarly, across a wide range of global system changes, we find that S1 consistently chooses the action that maximizes the number of things in its environment that appear blue to human observers, while S2 consistently goes on painting.
I agree with you that if I reify S2′s preferences in an uncautious way, I might start thinkng of S2 as “wanting to paint things blue” or “wanting everything to be blue” or “enjoying painting things blue” or as having various other similar internal states that might simply not exist, and that I do better to say it has a particular effect on the global system. S2 simply paints things blue; whether it has the goal of painting things blue or not, I have no idea.
I am far less comfortable saying that S1 has no goals, precisely because of how flexibly and consistently it is revising its actions so as to consistently create a state-change across wide ranges of environments. To use Dennett’s terminology, I am more willing to adopt an intentional stance with respect to S1 than S2.
If I’ve understood your position correctly, you’re saying that I’m unjustified in making that distinction… that to the extent that we can say that S1 and S2 have “goals,” the word “goals” simply refer to the state changes they create in the world. Initially they both have the goal of painting things blue, but S1′s goals keep changing: first it paints things blue, then it presses a button, then it does other things. And, sure, I can make up some story like “S1 maximizes the number of things in its environment that appear blue to human observers, while S2 just paints stuff blue” and that story might even have predictive power, but I ought not fall into the trap of reifying some actual thing that corresponds to those notional “goals”.
Am I in the right ballpark?
I think you’re switching back and forth between a Rational Choice Theory ‘preference’ and an Ideal Self Theory ‘preference’. To disambiguate, I’ll call the former R-preferences and the latter I-preferences. My R-preferences—the preferences you’d infer I had from my behaviors if you treated me as a rational agent—are extremely convoluted, indeed they need to be strongly time-indexed to maintain consistency. My I-preferences are the things I experience a desire for, whether or not that desire impacts my behavior. (Or they’re the things I would, with sufficient reflective insight and understanding into my situation, experience a desire for.)
We have no direct evidence from your story addressing whether S1 or S2 have I-preferences at all. Are they sentient? Do they create models of their own cognitive states? Perhaps we have a little more evidence that S1 has I-preferences than that S2 does, but only by assuming that a system whose goals require more intelligence or theory-of-mind will have a phenomenology more similar to a human’s. I wouldn’t be surprised if that assumption turns out to break down in some important ways, as we explore more of mind-space.
But my main point was that it doesn’t much matter what S1 or S2′s I-preferences are, if all we’re concerned about is what effect they’ll have on their environment. Then we should think about their R-preferences, and bracket exactly what psychological mechanism is resulting in their behavior, and how that psychological mechanism relates to itself.
I’ve said that R-preferences are theoretical constructs that happen to be useful a lot of the time for modeling complex behavior; I’m not sure whether I-preferences are closer to nature’s joints.
S1′s instrumental goals may keep changing, because its circumstances are changing. But I don’t think its terminal goals are changing. The only reason to model it as having two completely incommensurate goal sets at different times would be if there were no simple terminal goal that could explain the change in instrumental behavior.
I don’t think I’m switching back and forth between I-preferences and R-preferences.
I don’t think I’m talking about I-preferences at all, nor that I ever have been.
I completely agree with you that they don’t matter for our purposes here, so if I am talking about them, I am very very confused. (Which is certainly possible.)
But I don’t think that R-preferences (preferences, goals, etc.) can sensibly be equated with the actual effects a local system has on a global system. If they could, we could talk equally sensibly about earthquakes having R-preferences (preferences, goals, etc.), and I don’t think it’s sensible to talk that way.
R-preferences (preferences, goals, etc.) are, rather, internal states of a system S.
If S is a competent optimizer (or “rational agent,” if you prefer) with R-preferences (preferences, goals, etc.) P, the existence of P will cause S to behave in ways that cause isomorphic effects (E) on a global system, so we can use observations of E as evidence of P (positing that S is a competent optimizer) or as evidence that S is a competent optimizer (positing the existence of P) or a little of both.
But however we slice it, P is not the same thing as E, E is merely evidence of P’s existence. We can infer P’s existence in other ways as well, even if we never observe E… indeed, even if E never gets produced. And the presence or absence of a given P in S is something we can be mistaken about; there’s a fact of the matter.
I think you disagree with the above paragraph, because you describe R-preferences (preferences, goals, etc.) as theoretical constructs rather than parts of the system, which suggests that there is no fact of the matter… a different theoretical approach might never include P, and it would not be mistaken, it would just be a different theoretical approach.
I also think that because way back at the beginning of this exchange when I suggested “paint everything red AND paint everything blue” was an example of an incoherent goal (R-preference, preference, P), your reply was that it wasn’t a goal at all, since that state can’t actually exist in the world. Which suggests that you don’t see goals as internal states of optimizers and that you do equate P with E.
This is what I’ve been disputing from the beginning.
But to be honest, I’m not sure whether you disagree or not, as I’m not sure we have yet succeeded in actually engaging with one another’s ideas in this exchange.
You can treat earthquakes and thunderstorms and even individual particles as having ‘preferences’. It’s just not very useful to do so, because we can give an equally simple explanation for what effects things like earthquakes tend to have that is more transparent about the physical mechanism at work. The intentional strategy is a heuristic for black-boxing physical processes that are too complicated to usefully describe in their physical dynamics, but that can be discussed in terms of the complicated outcomes they tend to promote.
(I’d frame it: We’re exploiting the fact that humans are intuitively dualistic by taking the non-physical modeling device of humans (theory of mind, etc.) and appropriating this mental language and concept-web for all sorts of systems whose nuts and bolts we want to bracket. Slightly regimented mental concepts and terms are useful, not because they apply to all the systems we’re talking about in the same way they were originally applied to humans, but because they’re vague in ways that map onto the things we’re uncertain about or indifferent to.)
‘X wants to do Y’ means that the specific features of X tend to result in Y when its causal influence is relatively large and direct. But, for clarity’s sake, we adopt the convention of only dropping into want-speak when a system is too complicated for us to easily grasp in mechanistic terms why it’s having these complex effects, yet when we can predict that, whatever the mechanism happens to be, it is the sort of mechanism that has those particular complex effects.
Thus we speak of evolution as an optimization process, as though it had a ‘preference ordering’ in the intuitively human (i.e., I-preference) sense, even though in the phenomenological sense it’s just as mindless as an earthquake. We do this because black-boxing the physical mechanisms and just focusing on the likely outcomes is often predictively useful here, and because the outcomes are complicated and specific. This is useful for AIs because we care about the AI’s consequences and not its subjectivity (hence we focused on R-preference), and because AIs are optimization processes of even greater complex specificity in mechanism and outcome than evolution (hence we adopted the intentional stance of ‘preference’-talk in the first place).
I agree this is often the case, because when we define ‘what is this system capable of?’ we often hold the system fixed while examining possible worlds where the environment varies in all kinds of ways. But if the possible worlds we care about all have a certain environmental feature in common—say, because we know in reality that the environmental condition obtains, and we’re trying to figure out all the ways the AI might in fact behave given different values for the variables we don’t know about with confidence—then we may, in effect, include something about the environment ‘in the AI’ for the purposes of assessing its optimization power and/or preference ordering.
For instance, we might model the AI as having the preference ‘surround the Sun with a dyson sphere’ rather than ‘conditioned on there being a Sun, surround it with a dyson sphere’; if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun. Does that mean the Sun is a part of the AI’s preference encoding? Is the Sun a component of the AI? I don’t think these questions are important or interesting, so I don’t want us to be too committed to reifying AI preferences. They’re just a useful shorthand for the expected outcomes of the AI’s distinguishing features having a more large and direct causal impact on things.
Yes, agreed, for some fuzzy notion of “easily grasp” and “too complicated.” That is, there’s a sense in which thunderstorms are too complicated for me to describe in mechanistic terms why they’re having the effects they have… I certainly can’t predict those effects. But there’s also a sense in which I can describe (and even predict) the effects of a thunderstorm that feels simple, whereas I can’t do the same thing for a human being without invoking “want-speak”/intentional stance.
I’m not sure any of this is [i]justified[/i], but I agree that it is what we do… this is how we speak, and we draw these distinctions. So far, so good.
I’m not really sure what you mean by “in the AI” here, but I guess I agree that the boundary between an agent and its environment is always a fuzzy one. So, OK, I suppose we can include things about the environment “in the AI” if we choose. (I can similarly choose to include things about the environment “in myself.”) So far, so good.
Here is where you lose me again… once again you talk as though there’s simply no fact of the matter as to which preference the AI has, merely our choice as to how we model it.
But it seems to me that there are observations I can make which would provide evidence one way or the other. For example, if it has the preference ‘surround the Sun with a dyson sphere,’ then in an environment lacking the Sun I would expect it to first seek to create the Sun… how else can it implement its preferences? Whereas if it has the preference ‘conditioned on there being a Sun, surround it with a dyson sphere’; in an environment lacking the Sun I would not expect it to create the Sun.
So does the AI seek create the Sun in such an environment, or not? Surely that doesn’t depend on how I choose to model it. The AI’s preference is whatever it is, and controls its behavior. Of course, as you say, if the real world always includes a sun, then I might not be able to tell which preference the AI has. (Then again I might… the test I describe above isn’t the only test I can perform, just the first one I thought of, and other tests might not depend on the Sun’s absence.)
But whether I can tell or not doesn’t affect whether the AI has the preference or not.
Again, no. Regardless of how we model it, the system’s preference is what it is, and we can study the system (e.g., see whether it creates the Sun) to develop more accurate models of its preferences.
I agree. But I do think the question of what the AI (or, more generally, an optimizing agent) will do in various situations is interesting, and it seems to be that you’re consistently eliding over that question in ways I find puzzling.