“Since I am so uncertain of Kasparov’s moves, what is the empirical content of my belief that ‘Kasparov is a highly intelligent chess player’? What real-world experience does my belief tell me to anticipate? [...]
“The empirical content of my belief is the testable, falsifiable prediction that the final chess position will occupy the class of chess positions that are wins for Kasparov, rather than drawn games or wins for Mr. G. [...] The degree to which I think Kasparov is a ‘better player’ is reflected in the amount of probability mass I concentrate into the ‘Kasparov wins’ class of outcomes, versus the ‘drawn game’ and ‘Mr. G wins’ class of outcomes.”
“When I think you’re a powerful intelligence, and I think I know something about your preferences, then I’ll predict that you’ll steer reality into regions that are higher in your preference ordering. [...]
“Ah, but how do you know a mind’s preference ordering? Suppose you flip a coin 30 times and it comes up with some random-looking string—how do you know this wasn’t because a mind wanted it to produce that string?
“This, in turn, is reminiscent of the Minimum Message Length formulation of Occam’s Razor: if you send me a message telling me what a mind wants and how powerful it is, then this should enable you to compress your description of future events and observations, so that the total message is shorter. Otherwise there is no predictive benefit to viewing a system as an optimization process. This criterion tells us when to take the intentional stance.
“(3) Actually, you need to fit another criterion to take the intentional stance—there can’t be a better description that averts the need to talk about optimization. This is an epistemic criterion more than a physical one—a sufficiently powerful mind might have no need to take the intentional stance toward a human, because it could just model the regularity of our brains like moving parts in a machine.
“(4) If you have a coin that always comes up heads, there’s no need to say “The coin always wants to come up heads” because you can just say “the coin always comes up heads”. Optimization will beat alternative mechanical explanations when our ability to perturb a system defeats our ability to predict its interim steps in detail, but not our ability to predict a narrow final outcome. (Again, note that this is an epistemic criterion.)
“(5) Suppose you believe a mind exists, but you don’t know its preferences? Then you use some of your evidence to infer the mind’s preference ordering, and then use the inferred preferences to infer the mind’s power, then use those two beliefs to testably predict future outcomes. The total gain in predictive accuracy should exceed the complexity-cost of supposing that ‘there’s a mind of unknown preferences around’, the initial hypothesis.”
Notice that throughout this discussion, what matters is the mind’s effect on its environment, not any internal experience of the mind. Unconscious preferences are just as relevant to this method as are conscious preferences, and both are examples of the intentional stance. Note also that you can’t really measure the rationality of a system you’re modeling in this way; any evidence you raise for ‘irrationality’ could just as easily be used as evidence that the system has more complicated preferences than you initially thought, or that they’re encoded in a more distributed way than you had previously hypothesized.
My take-away from this is that there are two ways we generally think about minds on LessWrong: Rational Choice Theory, on which all minds are equally rational and strange or irregular behaviors are seen as evidence of strange preferences; and what we might call the Ideal Self Theory, on which minds’ revealed preferences can differ from their ‘true self’ preferences, resulting in irrationality. One way of unpacking my idealized values is that they’re the rational-choice-theory preferences I would exhibit if my conscious desires exhibited perfect control over my consciously controllable behavior, and those desires were the desires my ideal self would reflectively prefer, where my ideal self is the best trade-off between preserving my current psychology and enhancing that psychology’s understanding of itself and its environment.
We care about ideal selves when we think about humans, because we value our conscious, ‘felt’ desires (especially when they are stable under reflection) more than our unconscious dispositions. So we want to bring our actual behavior (and thus our rational-choice-theory preferences, the ‘preferences’ we talk about when we speak of an AI) more in line with our phenomenological longings and their idealized enhancements. But since we don’t care about making non-person AIs more self-actualized, but just care about how they tend to guide their environment, we generally just assume that they’re rational. Thus if an AI behaves in a crazy way (e.g., alternating between destroying and creating paperclips depending on what day of the week it is), it’s not because it’s a sane rational ghost trapped by crazy constraints. It’s because the AI has crazy core preferences.
Where did “models its environment” come from?
If we’re talking about the things S optimizes its environment for, not the things S “has in mind”, then it would seem that whether S models its environment or not is entirely irrelevant to the conversation.
Yes, in principle. But in practice, a system that doesn’t have internal states that track the world around it in a reliable and useable way won’t be able to optimize very well for anything particularly unlikely across a diverse set of environments. In other words, it won’t be very intelligent. To clarify, this is an empirical claim I’m making about what it takes to be particularly intelligent in our universe; it’s not part of the definition for ‘intelligent’.
a system that doesn’t have internal states that track the world around it in a reliable and useable way won’t be able to optimize very well for anything particularly unlikely across a diverse set of environments
Yes, that seems plausible.
I would say rather that modeling one’s environment is an effective tool for consistently optimizing for some specific unlikely thing X across a range of environments, so optimizers that do so will be more successful at optimizing for X, all else being equal, but it more or less amounts to the same thing.
But… so what?
I mean, it also seems plausible that optimizers that explicitly represent X as a goal will be more successful at consistently optimizing for X, all else being equal… but that doesn’t stop you from asserting that explicit representation of X is irrelevant to whether a system has X as its goal.
So why isn’t modeling the environment equally irrelevant? Both features, on your account, are optional enhancements an optimizer might or might not display.
It keeps seeming like all the stuff you quote and say before your last two paragraphs ought to provide an answer that question, but after reading it several times I can’t see what answer it might be providing. Perhaps your argument is just going over my head, in which case I apologize for wasting your time by getting into a conversation I’m not equipped for..
Maybe it will help to keep in mind that this is one small branch of my conversation with Alexander Kruel. Alexander’s two main objections to funding Friendly Artificial Intelligence research are that (1) advanced intelligence is very complicated and difficult to make, and (2) getting a thing to pursue a determinate goal at all is extraordinarily difficult. So a superintelligence will never be invented, or at least not for the foreseeable future; so we shouldn’t think about SI-related existential risks. (This is my steel-manning of his view. The way he actually argues seems to instead be predicated on inventing SI being tied to perfecting Friendliness Theory, but I haven’t heard a consistent argument for why that should be so.)
Both of these views, I believe, are predicated on a misunderstanding of how simple and disjunctive ‘intelligence’ and ‘goal’ are, for present purposes. So I’ve mainly been working on tabooing and demystifying those concepts. Intelligence is simply a disposition to efficiently convert a wide variety of circumstances into some set of specific complex events. Goals are simply the circumstances that occur more often when a given intelligence is around. These are both very general and disjunctive ideas, in stark contrast to Friendliness; so it will be difficult to argue that a superintelligence simply can’t be made, and difficult too to argue that optimizing for intelligence requires one to have a good grasp on Friendliness Theory.
Because I’m trying to taboo the idea of superintelligence, and explain what it is about seed AI that will allow it to start recursively improving its own intelligence, I’ve been talking a lot about the important role modeling plays in high-level intelligent processes. Recognizing what a simple idea modeling is, and how far it gets one toward superintelligence once one has domain-general modeling proficiency, helps a great deal with greasing the intuition pump ‘Explosive AGI is a simple, disjunctive event, a low-hanging fruit, relative to Friendliness.’ Demystifying unpacking makes things seem less improbable and convoluted.
I mean, it also seems plausible that optimizers that explicitly represent X as a goal will be more successful at consistently optimizing for X, all else being equal… but that doesn’t stop you from asserting that explicit representation of X is irrelevant to whether a system has X as its goal.
I think this is a map/territory confusion. I’m not denying that superintelligences will have a map of their own preferences; at a bare minimum, they need to know what they want in order to prevent themselves from accidentally changing their own preferences. But this map won’t be the AI’s preferences—those may be a very complicated causal process bound up with, say, certain environmental factors surrounding the AI, or oscillating with time, or who-knows-what.
There may not be a sharp line between the ‘preference’ part of the AI and the ‘non-preference’ part. Since any superintelligence will be exemplary at reasoning with uncertainty and fuzzy categories, I don’t think that will be a serious obstacle.
Does that help explain why I’m coming from? If not, maybe I’m missing the thread unifying your comments.
I suppose it helps, if only in that it establishes that much of what you’re saying to me is actually being addressed indirectly to somebody else, so it ought not surprise me that I can’t quite connect much of it to anything I’ve said. Thanks for clarifying your intent.
For my own part, I’m certainly not functioning here as Alex’s proxy; while I don’t consider explosive intelligence growth as much of a foregone conclusion as many folks here do, I also don’t consider Alex’s passionate rejection of the possibility justified, and have had extended discussions on related subjects with him myself in past years. So most of what you write in response to Alex’s positions is largely talking right past me.
(Which is not to say that you ought not be doing it. If this is in effect a private argument between you and Alex that I’ve stuck my nose into, let me know and I’ll apologize and leave y’all to it in peace.)
Anyway, I certainly agree that a system might have a representation of its goals that is distinct from the mechanisms that cause it to pursue those goals. I have one of those, myself. (Indeed, several.) But if a system is capable of affecting its pursuit of its goals (for example, if it is capable of correcting the effects of a state-change that would, uncorrected, have led to value drift), it is not merely interacting with maps. It is also interacting with the territory… that is, it is modifying the mechanisms that cause it to pursue those goals… in order to bring that territory into line with its pre-existing map.
And in order to do that, it must have such a mechanism, and that mechanism must be consistently isomorphic to its representations of its goals.
Right. I’m not saying that there aren’t things about the AI that make it behave the way it does; what the AI optimizes for is a deterministic result of its properties plus environment. I’m just saying that something about the environment might be necessary for it to have the sorts of preferences we can most usefully model it as having; and/or there may be multiple equally good candidates for the parts of the AI that are its values, or their encoding. If we reify preferences in an uncautious way, we’ll start thinking of the AI’s ‘desires’ too much as its first-person-experienced urges, as opposed to just thinking of them as the effect the local system we’re talking about tends to have on the global system.
So, all right. Cconsider two systems, S1 and S2, both of which happen to be constructed in such a way that right now, they are maximizing the number of things in their environment that appear blue to human observers, by going around painting everything blue.
Suppose we add to the global system a button that alters all human brains so that everything appears blue to us, and we find that S1 presses the button and stops painting, and S2 ignores the button and goes on painting.
Suppose that similarly, across a wide range of global system changes, we find that S1 consistently chooses the action that maximizes the number of things in its environment that appear blue to human observers, while S2 consistently goes on painting.
I agree with you that if I reify S2′s preferences in an uncautious way, I might start thinkng of S2 as “wanting to paint things blue” or “wanting everything to be blue” or “enjoying painting things blue” or as having various other similar internal states that might simply not exist, and that I do better to say it has a particular effect on the global system. S2 simply paints things blue; whether it has the goal of painting things blue or not, I have no idea.
I am far less comfortable saying that S1 has no goals, precisely because of how flexibly and consistently it is revising its actions so as to consistently create a state-change across wide ranges of environments. To use Dennett’s terminology, I am more willing to adopt an intentional stance with respect to S1 than S2.
If I’ve understood your position correctly, you’re saying that I’m unjustified in making that distinction… that to the extent that we can say that S1 and S2 have “goals,” the word “goals” simply refer to the state changes they create in the world. Initially they both have the goal of painting things blue, but S1′s goals keep changing: first it paints things blue, then it presses a button, then it does other things. And, sure, I can make up some story like “S1 maximizes the number of things in its environment that appear blue to human observers, while S2 just paints stuff blue” and that story might even have predictive power, but I ought not fall into the trap of reifying some actual thing that corresponds to those notional “goals”.
I think you’re switching back and forth between a Rational Choice Theory ‘preference’ and an Ideal Self Theory ‘preference’. To disambiguate, I’ll call the former R-preferences and the latter I-preferences. My R-preferences—the preferences you’d infer I had from my behaviors if you treated me as a rational agent—are extremely convoluted, indeed they need to be strongly time-indexed to maintain consistency. My I-preferences are the things I experience a desire for, whether or not that desire impacts my behavior. (Or they’re the things I would, with sufficient reflective insight and understanding into my situation, experience a desire for.)
We have no direct evidence from your story addressing whether S1 or S2 have I-preferences at all. Are they sentient? Do they create models of their own cognitive states? Perhaps we have a little more evidence that S1 has I-preferences than that S2 does, but only by assuming that a system whose goals require more intelligence or theory-of-mind will have a phenomenology more similar to a human’s. I wouldn’t be surprised if that assumption turns out to break down in some important ways, as we explore more of mind-space.
But my main point was that it doesn’t much matter what S1 or S2′s I-preferences are, if all we’re concerned about is what effect they’ll have on their environment. Then we should think about their R-preferences, and bracket exactly what psychological mechanism is resulting in their behavior, and how that psychological mechanism relates to itself.
I’ve said that R-preferences are theoretical constructs that happen to be useful a lot of the time for modeling complex behavior; I’m not sure whether I-preferences are closer to nature’s joints.
Initially they both have the goal of painting things blue, but S1′s goals keep changing: first it paints things blue, then it presses a button, then it does other things.
S1′s instrumental goals may keep changing, because its circumstances are changing. But I don’t think its terminal goals are changing. The only reason to model it as having two completely incommensurate goal sets at different times would be if there were no simple terminal goal that could explain the change in instrumental behavior.
I don’t think I’m switching back and forth between I-preferences and R-preferences.
I don’t think I’m talking about I-preferences at all, nor that I ever have been.
I completely agree with you that they don’t matter for our purposes here, so if I am talking about them, I am very very confused. (Which is certainly possible.)
But I don’t think that R-preferences (preferences, goals, etc.) can sensibly be equated with the actual effects a local system has on a global system. If they could, we could talk equally sensibly about earthquakes having R-preferences (preferences, goals, etc.), and I don’t think it’s sensible to talk that way.
R-preferences (preferences, goals, etc.) are, rather, internal states of a system S.
If S is a competent optimizer (or “rational agent,” if you prefer) with R-preferences (preferences, goals, etc.) P, the existence of P will cause S to behave in ways that cause isomorphic effects (E) on a global system, so we can use observations of E as evidence of P (positing that S is a competent optimizer) or as evidence that S is a competent optimizer (positing the existence of P) or a little of both.
But however we slice it, P is not the same thing as E, E is merely evidence of P’s existence. We can infer P’s existence in other ways as well, even if we never observe E… indeed, even if E never gets produced. And the presence or absence of a given P in S is something we can be mistaken about; there’s a fact of the matter.
I think you disagree with the above paragraph, because you describe R-preferences (preferences, goals, etc.) as theoretical constructs rather than parts of the system, which suggests that there is no fact of the matter… a different theoretical approach might never include P, and it would not be mistaken, it would just be a different theoretical approach.
I also think that because way back at the beginning of this exchange when I suggested “paint everything red AND paint everything blue” was an example of an incoherent goal (R-preference, preference, P), your reply was that it wasn’t a goal at all, since that state can’t actually exist in the world. Which suggests that you don’t see goals as internal states of optimizers and that you do equate P with E.
This is what I’ve been disputing from the beginning.
But to be honest, I’m not sure whether you disagree or not, as I’m not sure we have yet succeeded in actually engaging with one another’s ideas in this exchange.
But I don’t think that R-preferences (preferences, goals, etc.) can sensibly be equated with the actual effects a local system has on a global system. If they could, we could talk equally sensibly about earthquakes having R-preferences (preferences, goals, etc.), and I don’t think it’s sensible to talk that way.
You can treat earthquakes and thunderstorms and even individual particles as having ‘preferences’. It’s just not very useful to do so, because we can give an equally simple explanation for what effects things like earthquakes tend to have that is more transparent about the physical mechanism at work. The intentional strategy is a heuristic for black-boxing physical processes that are too complicated to usefully describe in their physical dynamics, but that can be discussed in terms of the complicated outcomes they tend to promote.
(I’d frame it: We’re exploiting the fact that humans are intuitively dualistic by taking the non-physical modeling device of humans (theory of mind, etc.) and appropriating this mental language and concept-web for all sorts of systems whose nuts and bolts we want to bracket. Slightly regimented mental concepts and terms are useful, not because they apply to all the systems we’re talking about in the same way they were originally applied to humans, but because they’re vague in ways that map onto the things we’re uncertain about or indifferent to.)
‘X wants to do Y’ means that the specific features of X tend to result in Y when its causal influence is relatively large and direct. But, for clarity’s sake, we adopt the convention of only dropping into want-speak when a system is too complicated for us to easily grasp in mechanistic terms why it’s having these complex effects, yet when we can predict that, whatever the mechanism happens to be, it is the sort of mechanism that has those particular complex effects.
Thus we speak of evolution as an optimization process, as though it had a ‘preference ordering’ in the intuitively human (i.e., I-preference) sense, even though in the phenomenological sense it’s just as mindless as an earthquake. We do this because black-boxing the physical mechanisms and just focusing on the likely outcomes is often predictively useful here, and because the outcomes are complicated and specific. This is useful for AIs because we care about the AI’s consequences and not its subjectivity (hence we focused on R-preference), and because AIs are optimization processes of even greater complex specificity in mechanism and outcome than evolution (hence we adopted the intentional stance of ‘preference’-talk in the first place).
R-preferences (preferences, goals, etc.) are, rather, internal states of a system S.
I agree this is often the case, because when we define ‘what is this system capable of?’ we often hold the system fixed while examining possible worlds where the environment varies in all kinds of ways. But if the possible worlds we care about all have a certain environmental feature in common—say, because we know in reality that the environmental condition obtains, and we’re trying to figure out all the ways the AI might in fact behave given different values for the variables we don’t know about with confidence—then we may, in effect, include something about the environment ‘in the AI’ for the purposes of assessing its optimization power and/or preference ordering.
For instance, we might model the AI as having the preference ‘surround the Sun with a dyson sphere’ rather than ‘conditioned on there being a Sun, surround it with a dyson sphere’; if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun. Does that mean the Sun is a part of the AI’s preference encoding? Is the Sun a component of the AI? I don’t think these questions are important or interesting, so I don’t want us to be too committed to reifying AI preferences. They’re just a useful shorthand for the expected outcomes of the AI’s distinguishing features having a more large and direct causal impact on things.
‘X wants to do Y’ means that the specific features of X tend to result in Y when its causal influence is relatively large and direct. But, for clarity’s sake, we adopt the convention of only dropping into want-speak when a system is too complicated for us to easily grasp in mechanistic terms why it’s having these complex effects
Yes, agreed, for some fuzzy notion of “easily grasp” and “too complicated.” That is, there’s a sense in which thunderstorms are too complicated for me to describe in mechanistic terms why they’re having the effects they have… I certainly can’t predict those effects. But there’s also a sense in which I can describe (and even predict) the effects of a thunderstorm that feels simple, whereas I can’t do the same thing for a human being without invoking “want-speak”/intentional stance.
I’m not sure any of this is [i]justified[/i], but I agree that it is what we do… this is how we speak, and we draw these distinctions. So far, so good.
if the possible worlds we care about all have a certain environmental feature in common [..] we may, in effect, include something about the environment ‘in the AI’
I’m not really sure what you mean by “in the AI” here, but I guess I agree that the boundary between an agent and its environment is always a fuzzy one. So, OK, I suppose we can include things about the environment “in the AI” if we choose. (I can similarly choose to include things about the environment “in myself.”) So far, so good.
we might model the AI as having the preference ‘surround the Sun with a dyson sphere’ rather than ‘conditioned on there being a Sun, surround it with a dyson sphere’; if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun.
Here is where you lose me again… once again you talk as though there’s simply no fact of the matter as to which preference the AI has, merely our choice as to how we model it.
But it seems to me that there are observations I can make which would provide evidence one way or the other. For example, if it has the preference ‘surround the Sun with a dyson sphere,’ then in an environment lacking the Sun I would expect it to first seek to create the Sun… how else can it implement its preferences? Whereas if it has the preference ‘conditioned on there being a Sun, surround it with a dyson sphere’; in an environment lacking the Sun I would not expect it to create the Sun.
So does the AI seek create the Sun in such an environment, or not? Surely that doesn’t depend on how I choose to model it. The AI’s preference is whatever it is, and controls its behavior. Of course, as you say, if the real world always includes a sun, then I might not be able to tell which preference the AI has. (Then again I might… the test I describe above isn’t the only test I can perform, just the first one I thought of, and other tests might not depend on the Sun’s absence.)
But whether I can tell or not doesn’t affect whether the AI has the preference or not.
if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun
Again, no. Regardless of how we model it, the system’s preference is what it is, and we can study the system (e.g., see whether it creates the Sun) to develop more accurate models of its preferences.
Does that mean the Sun is a part of the AI’s preference encoding? Is the Sun a component of the AI? I don’t think these questions are important or interesting
I agree. But I do think the question of what the AI (or, more generally, an optimizing agent) will do in various situations is interesting, and it seems to be that you’re consistently eliding over that question in ways I find puzzling.
From Eliezer’s Belief in Intelligence:
“Since I am so uncertain of Kasparov’s moves, what is the empirical content of my belief that ‘Kasparov is a highly intelligent chess player’? What real-world experience does my belief tell me to anticipate? [...]
“The empirical content of my belief is the testable, falsifiable prediction that the final chess position will occupy the class of chess positions that are wins for Kasparov, rather than drawn games or wins for Mr. G. [...] The degree to which I think Kasparov is a ‘better player’ is reflected in the amount of probability mass I concentrate into the ‘Kasparov wins’ class of outcomes, versus the ‘drawn game’ and ‘Mr. G wins’ class of outcomes.”
From Measuring Optimization Power:
“When I think you’re a powerful intelligence, and I think I know something about your preferences, then I’ll predict that you’ll steer reality into regions that are higher in your preference ordering. [...]
“Ah, but how do you know a mind’s preference ordering? Suppose you flip a coin 30 times and it comes up with some random-looking string—how do you know this wasn’t because a mind wanted it to produce that string?
“This, in turn, is reminiscent of the Minimum Message Length formulation of Occam’s Razor: if you send me a message telling me what a mind wants and how powerful it is, then this should enable you to compress your description of future events and observations, so that the total message is shorter. Otherwise there is no predictive benefit to viewing a system as an optimization process. This criterion tells us when to take the intentional stance.
“(3) Actually, you need to fit another criterion to take the intentional stance—there can’t be a better description that averts the need to talk about optimization. This is an epistemic criterion more than a physical one—a sufficiently powerful mind might have no need to take the intentional stance toward a human, because it could just model the regularity of our brains like moving parts in a machine.
“(4) If you have a coin that always comes up heads, there’s no need to say “The coin always wants to come up heads” because you can just say “the coin always comes up heads”. Optimization will beat alternative mechanical explanations when our ability to perturb a system defeats our ability to predict its interim steps in detail, but not our ability to predict a narrow final outcome. (Again, note that this is an epistemic criterion.)
“(5) Suppose you believe a mind exists, but you don’t know its preferences? Then you use some of your evidence to infer the mind’s preference ordering, and then use the inferred preferences to infer the mind’s power, then use those two beliefs to testably predict future outcomes. The total gain in predictive accuracy should exceed the complexity-cost of supposing that ‘there’s a mind of unknown preferences around’, the initial hypothesis.”
Notice that throughout this discussion, what matters is the mind’s effect on its environment, not any internal experience of the mind. Unconscious preferences are just as relevant to this method as are conscious preferences, and both are examples of the intentional stance. Note also that you can’t really measure the rationality of a system you’re modeling in this way; any evidence you raise for ‘irrationality’ could just as easily be used as evidence that the system has more complicated preferences than you initially thought, or that they’re encoded in a more distributed way than you had previously hypothesized.
My take-away from this is that there are two ways we generally think about minds on LessWrong: Rational Choice Theory, on which all minds are equally rational and strange or irregular behaviors are seen as evidence of strange preferences; and what we might call the Ideal Self Theory, on which minds’ revealed preferences can differ from their ‘true self’ preferences, resulting in irrationality. One way of unpacking my idealized values is that they’re the rational-choice-theory preferences I would exhibit if my conscious desires exhibited perfect control over my consciously controllable behavior, and those desires were the desires my ideal self would reflectively prefer, where my ideal self is the best trade-off between preserving my current psychology and enhancing that psychology’s understanding of itself and its environment.
We care about ideal selves when we think about humans, because we value our conscious, ‘felt’ desires (especially when they are stable under reflection) more than our unconscious dispositions. So we want to bring our actual behavior (and thus our rational-choice-theory preferences, the ‘preferences’ we talk about when we speak of an AI) more in line with our phenomenological longings and their idealized enhancements. But since we don’t care about making non-person AIs more self-actualized, but just care about how they tend to guide their environment, we generally just assume that they’re rational. Thus if an AI behaves in a crazy way (e.g., alternating between destroying and creating paperclips depending on what day of the week it is), it’s not because it’s a sane rational ghost trapped by crazy constraints. It’s because the AI has crazy core preferences.
Yes, in principle. But in practice, a system that doesn’t have internal states that track the world around it in a reliable and useable way won’t be able to optimize very well for anything particularly unlikely across a diverse set of environments. In other words, it won’t be very intelligent. To clarify, this is an empirical claim I’m making about what it takes to be particularly intelligent in our universe; it’s not part of the definition for ‘intelligent’.
Yes, that seems plausible.
I would say rather that modeling one’s environment is an effective tool for consistently optimizing for some specific unlikely thing X across a range of environments, so optimizers that do so will be more successful at optimizing for X, all else being equal, but it more or less amounts to the same thing.
But… so what?
I mean, it also seems plausible that optimizers that explicitly represent X as a goal will be more successful at consistently optimizing for X, all else being equal… but that doesn’t stop you from asserting that explicit representation of X is irrelevant to whether a system has X as its goal.
So why isn’t modeling the environment equally irrelevant? Both features, on your account, are optional enhancements an optimizer might or might not display.
It keeps seeming like all the stuff you quote and say before your last two paragraphs ought to provide an answer that question, but after reading it several times I can’t see what answer it might be providing. Perhaps your argument is just going over my head, in which case I apologize for wasting your time by getting into a conversation I’m not equipped for..
Maybe it will help to keep in mind that this is one small branch of my conversation with Alexander Kruel. Alexander’s two main objections to funding Friendly Artificial Intelligence research are that (1) advanced intelligence is very complicated and difficult to make, and (2) getting a thing to pursue a determinate goal at all is extraordinarily difficult. So a superintelligence will never be invented, or at least not for the foreseeable future; so we shouldn’t think about SI-related existential risks. (This is my steel-manning of his view. The way he actually argues seems to instead be predicated on inventing SI being tied to perfecting Friendliness Theory, but I haven’t heard a consistent argument for why that should be so.)
Both of these views, I believe, are predicated on a misunderstanding of how simple and disjunctive ‘intelligence’ and ‘goal’ are, for present purposes. So I’ve mainly been working on tabooing and demystifying those concepts. Intelligence is simply a disposition to efficiently convert a wide variety of circumstances into some set of specific complex events. Goals are simply the circumstances that occur more often when a given intelligence is around. These are both very general and disjunctive ideas, in stark contrast to Friendliness; so it will be difficult to argue that a superintelligence simply can’t be made, and difficult too to argue that optimizing for intelligence requires one to have a good grasp on Friendliness Theory.
Because I’m trying to taboo the idea of superintelligence, and explain what it is about seed AI that will allow it to start recursively improving its own intelligence, I’ve been talking a lot about the important role modeling plays in high-level intelligent processes. Recognizing what a simple idea modeling is, and how far it gets one toward superintelligence once one has domain-general modeling proficiency, helps a great deal with greasing the intuition pump ‘Explosive AGI is a simple, disjunctive event, a low-hanging fruit, relative to Friendliness.’ Demystifying unpacking makes things seem less improbable and convoluted.
I think this is a map/territory confusion. I’m not denying that superintelligences will have a map of their own preferences; at a bare minimum, they need to know what they want in order to prevent themselves from accidentally changing their own preferences. But this map won’t be the AI’s preferences—those may be a very complicated causal process bound up with, say, certain environmental factors surrounding the AI, or oscillating with time, or who-knows-what.
There may not be a sharp line between the ‘preference’ part of the AI and the ‘non-preference’ part. Since any superintelligence will be exemplary at reasoning with uncertainty and fuzzy categories, I don’t think that will be a serious obstacle.
Does that help explain why I’m coming from? If not, maybe I’m missing the thread unifying your comments.
I suppose it helps, if only in that it establishes that much of what you’re saying to me is actually being addressed indirectly to somebody else, so it ought not surprise me that I can’t quite connect much of it to anything I’ve said. Thanks for clarifying your intent.
For my own part, I’m certainly not functioning here as Alex’s proxy; while I don’t consider explosive intelligence growth as much of a foregone conclusion as many folks here do, I also don’t consider Alex’s passionate rejection of the possibility justified, and have had extended discussions on related subjects with him myself in past years. So most of what you write in response to Alex’s positions is largely talking right past me.
(Which is not to say that you ought not be doing it. If this is in effect a private argument between you and Alex that I’ve stuck my nose into, let me know and I’ll apologize and leave y’all to it in peace.)
Anyway, I certainly agree that a system might have a representation of its goals that is distinct from the mechanisms that cause it to pursue those goals. I have one of those, myself. (Indeed, several.) But if a system is capable of affecting its pursuit of its goals (for example, if it is capable of correcting the effects of a state-change that would, uncorrected, have led to value drift), it is not merely interacting with maps. It is also interacting with the territory… that is, it is modifying the mechanisms that cause it to pursue those goals… in order to bring that territory into line with its pre-existing map.
And in order to do that, it must have such a mechanism, and that mechanism must be consistently isomorphic to its representations of its goals.
Yes?
Right. I’m not saying that there aren’t things about the AI that make it behave the way it does; what the AI optimizes for is a deterministic result of its properties plus environment. I’m just saying that something about the environment might be necessary for it to have the sorts of preferences we can most usefully model it as having; and/or there may be multiple equally good candidates for the parts of the AI that are its values, or their encoding. If we reify preferences in an uncautious way, we’ll start thinking of the AI’s ‘desires’ too much as its first-person-experienced urges, as opposed to just thinking of them as the effect the local system we’re talking about tends to have on the global system.
Hm.
So, all right. Cconsider two systems, S1 and S2, both of which happen to be constructed in such a way that right now, they are maximizing the number of things in their environment that appear blue to human observers, by going around painting everything blue.
Suppose we add to the global system a button that alters all human brains so that everything appears blue to us, and we find that S1 presses the button and stops painting, and S2 ignores the button and goes on painting.
Suppose that similarly, across a wide range of global system changes, we find that S1 consistently chooses the action that maximizes the number of things in its environment that appear blue to human observers, while S2 consistently goes on painting.
I agree with you that if I reify S2′s preferences in an uncautious way, I might start thinkng of S2 as “wanting to paint things blue” or “wanting everything to be blue” or “enjoying painting things blue” or as having various other similar internal states that might simply not exist, and that I do better to say it has a particular effect on the global system. S2 simply paints things blue; whether it has the goal of painting things blue or not, I have no idea.
I am far less comfortable saying that S1 has no goals, precisely because of how flexibly and consistently it is revising its actions so as to consistently create a state-change across wide ranges of environments. To use Dennett’s terminology, I am more willing to adopt an intentional stance with respect to S1 than S2.
If I’ve understood your position correctly, you’re saying that I’m unjustified in making that distinction… that to the extent that we can say that S1 and S2 have “goals,” the word “goals” simply refer to the state changes they create in the world. Initially they both have the goal of painting things blue, but S1′s goals keep changing: first it paints things blue, then it presses a button, then it does other things. And, sure, I can make up some story like “S1 maximizes the number of things in its environment that appear blue to human observers, while S2 just paints stuff blue” and that story might even have predictive power, but I ought not fall into the trap of reifying some actual thing that corresponds to those notional “goals”.
Am I in the right ballpark?
I think you’re switching back and forth between a Rational Choice Theory ‘preference’ and an Ideal Self Theory ‘preference’. To disambiguate, I’ll call the former R-preferences and the latter I-preferences. My R-preferences—the preferences you’d infer I had from my behaviors if you treated me as a rational agent—are extremely convoluted, indeed they need to be strongly time-indexed to maintain consistency. My I-preferences are the things I experience a desire for, whether or not that desire impacts my behavior. (Or they’re the things I would, with sufficient reflective insight and understanding into my situation, experience a desire for.)
We have no direct evidence from your story addressing whether S1 or S2 have I-preferences at all. Are they sentient? Do they create models of their own cognitive states? Perhaps we have a little more evidence that S1 has I-preferences than that S2 does, but only by assuming that a system whose goals require more intelligence or theory-of-mind will have a phenomenology more similar to a human’s. I wouldn’t be surprised if that assumption turns out to break down in some important ways, as we explore more of mind-space.
But my main point was that it doesn’t much matter what S1 or S2′s I-preferences are, if all we’re concerned about is what effect they’ll have on their environment. Then we should think about their R-preferences, and bracket exactly what psychological mechanism is resulting in their behavior, and how that psychological mechanism relates to itself.
I’ve said that R-preferences are theoretical constructs that happen to be useful a lot of the time for modeling complex behavior; I’m not sure whether I-preferences are closer to nature’s joints.
S1′s instrumental goals may keep changing, because its circumstances are changing. But I don’t think its terminal goals are changing. The only reason to model it as having two completely incommensurate goal sets at different times would be if there were no simple terminal goal that could explain the change in instrumental behavior.
I don’t think I’m switching back and forth between I-preferences and R-preferences.
I don’t think I’m talking about I-preferences at all, nor that I ever have been.
I completely agree with you that they don’t matter for our purposes here, so if I am talking about them, I am very very confused. (Which is certainly possible.)
But I don’t think that R-preferences (preferences, goals, etc.) can sensibly be equated with the actual effects a local system has on a global system. If they could, we could talk equally sensibly about earthquakes having R-preferences (preferences, goals, etc.), and I don’t think it’s sensible to talk that way.
R-preferences (preferences, goals, etc.) are, rather, internal states of a system S.
If S is a competent optimizer (or “rational agent,” if you prefer) with R-preferences (preferences, goals, etc.) P, the existence of P will cause S to behave in ways that cause isomorphic effects (E) on a global system, so we can use observations of E as evidence of P (positing that S is a competent optimizer) or as evidence that S is a competent optimizer (positing the existence of P) or a little of both.
But however we slice it, P is not the same thing as E, E is merely evidence of P’s existence. We can infer P’s existence in other ways as well, even if we never observe E… indeed, even if E never gets produced. And the presence or absence of a given P in S is something we can be mistaken about; there’s a fact of the matter.
I think you disagree with the above paragraph, because you describe R-preferences (preferences, goals, etc.) as theoretical constructs rather than parts of the system, which suggests that there is no fact of the matter… a different theoretical approach might never include P, and it would not be mistaken, it would just be a different theoretical approach.
I also think that because way back at the beginning of this exchange when I suggested “paint everything red AND paint everything blue” was an example of an incoherent goal (R-preference, preference, P), your reply was that it wasn’t a goal at all, since that state can’t actually exist in the world. Which suggests that you don’t see goals as internal states of optimizers and that you do equate P with E.
This is what I’ve been disputing from the beginning.
But to be honest, I’m not sure whether you disagree or not, as I’m not sure we have yet succeeded in actually engaging with one another’s ideas in this exchange.
You can treat earthquakes and thunderstorms and even individual particles as having ‘preferences’. It’s just not very useful to do so, because we can give an equally simple explanation for what effects things like earthquakes tend to have that is more transparent about the physical mechanism at work. The intentional strategy is a heuristic for black-boxing physical processes that are too complicated to usefully describe in their physical dynamics, but that can be discussed in terms of the complicated outcomes they tend to promote.
(I’d frame it: We’re exploiting the fact that humans are intuitively dualistic by taking the non-physical modeling device of humans (theory of mind, etc.) and appropriating this mental language and concept-web for all sorts of systems whose nuts and bolts we want to bracket. Slightly regimented mental concepts and terms are useful, not because they apply to all the systems we’re talking about in the same way they were originally applied to humans, but because they’re vague in ways that map onto the things we’re uncertain about or indifferent to.)
‘X wants to do Y’ means that the specific features of X tend to result in Y when its causal influence is relatively large and direct. But, for clarity’s sake, we adopt the convention of only dropping into want-speak when a system is too complicated for us to easily grasp in mechanistic terms why it’s having these complex effects, yet when we can predict that, whatever the mechanism happens to be, it is the sort of mechanism that has those particular complex effects.
Thus we speak of evolution as an optimization process, as though it had a ‘preference ordering’ in the intuitively human (i.e., I-preference) sense, even though in the phenomenological sense it’s just as mindless as an earthquake. We do this because black-boxing the physical mechanisms and just focusing on the likely outcomes is often predictively useful here, and because the outcomes are complicated and specific. This is useful for AIs because we care about the AI’s consequences and not its subjectivity (hence we focused on R-preference), and because AIs are optimization processes of even greater complex specificity in mechanism and outcome than evolution (hence we adopted the intentional stance of ‘preference’-talk in the first place).
I agree this is often the case, because when we define ‘what is this system capable of?’ we often hold the system fixed while examining possible worlds where the environment varies in all kinds of ways. But if the possible worlds we care about all have a certain environmental feature in common—say, because we know in reality that the environmental condition obtains, and we’re trying to figure out all the ways the AI might in fact behave given different values for the variables we don’t know about with confidence—then we may, in effect, include something about the environment ‘in the AI’ for the purposes of assessing its optimization power and/or preference ordering.
For instance, we might model the AI as having the preference ‘surround the Sun with a dyson sphere’ rather than ‘conditioned on there being a Sun, surround it with a dyson sphere’; if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun. Does that mean the Sun is a part of the AI’s preference encoding? Is the Sun a component of the AI? I don’t think these questions are important or interesting, so I don’t want us to be too committed to reifying AI preferences. They’re just a useful shorthand for the expected outcomes of the AI’s distinguishing features having a more large and direct causal impact on things.
Yes, agreed, for some fuzzy notion of “easily grasp” and “too complicated.” That is, there’s a sense in which thunderstorms are too complicated for me to describe in mechanistic terms why they’re having the effects they have… I certainly can’t predict those effects. But there’s also a sense in which I can describe (and even predict) the effects of a thunderstorm that feels simple, whereas I can’t do the same thing for a human being without invoking “want-speak”/intentional stance.
I’m not sure any of this is [i]justified[/i], but I agree that it is what we do… this is how we speak, and we draw these distinctions. So far, so good.
I’m not really sure what you mean by “in the AI” here, but I guess I agree that the boundary between an agent and its environment is always a fuzzy one. So, OK, I suppose we can include things about the environment “in the AI” if we choose. (I can similarly choose to include things about the environment “in myself.”) So far, so good.
Here is where you lose me again… once again you talk as though there’s simply no fact of the matter as to which preference the AI has, merely our choice as to how we model it.
But it seems to me that there are observations I can make which would provide evidence one way or the other. For example, if it has the preference ‘surround the Sun with a dyson sphere,’ then in an environment lacking the Sun I would expect it to first seek to create the Sun… how else can it implement its preferences? Whereas if it has the preference ‘conditioned on there being a Sun, surround it with a dyson sphere’; in an environment lacking the Sun I would not expect it to create the Sun.
So does the AI seek create the Sun in such an environment, or not? Surely that doesn’t depend on how I choose to model it. The AI’s preference is whatever it is, and controls its behavior. Of course, as you say, if the real world always includes a sun, then I might not be able to tell which preference the AI has. (Then again I might… the test I describe above isn’t the only test I can perform, just the first one I thought of, and other tests might not depend on the Sun’s absence.)
But whether I can tell or not doesn’t affect whether the AI has the preference or not.
Again, no. Regardless of how we model it, the system’s preference is what it is, and we can study the system (e.g., see whether it creates the Sun) to develop more accurate models of its preferences.
I agree. But I do think the question of what the AI (or, more generally, an optimizing agent) will do in various situations is interesting, and it seems to be that you’re consistently eliding over that question in ways I find puzzling.