To borrow Eliezer’s examples, do you think you would discard the value of boredom given enough time to reflect?
I don’t know, but at least it seems plausible that I might. It’s clear that I have boredom as an emotion and as a behavior, but I don’t see a clear reason why I would want to make it into a preference. Certainly there are times when I wish I wouldn’t get bored as easily as I actually do, so I don’t want to translate boredom into a preference “as is”.
If I think about why I might not want a future where everyone has no boredom and could enjoy the same thing over and over again, what’s driving that seems to be an intuitive aversion towards triviality (i.e., things that look too easy or lack challenge). And if I think more about why I might not want a future where things look easy instead of difficult, I can’t really think of anything except that it probably has to do with signaling that I’m someone who likes challenge, which does not seem like something I really want to base my “actual preferences” on.
Would you discard the desire to not be optimized too hard by an outside agent? Would you discard sympathy for other conscious humans?
These are similar for me. If I think about them long enough, it just becomes unclear why I might want to keep them.
Also, perhaps “keep” and “discard” aren’t the right word here. What you actually need to do (in this view of what values are), is affirmatively create preferences (e.g., a utility function) from your intuitions. So for any potential value under consideration, you need reasons not for why you would want to “discard”, but for why you would want to “create”.
This should’ve been obvious from the start, but your comment has forced me to realize it only now: if we understand reflective equilibrium as the end result of unrestricted iterated self-modification, then it’s very sensitive to starting conditions. You and I could end up having very different value systems because I’d begin my self-modification by strengthening my safeguards against simplification of values, while you’d begin by weakening yours. And a stupid person doing unrestricted iterated self-modification will just end up someplace stupid. So this interpretation of “reflective equilibrium” is almost useless, right?
The interpretation of “reflective equilibrium” that I currently have in mind is something like this (written by Eliezer), which I think is pretty close to Yvain’s version as well:
I see the project of morality as a project of renormalizing intuition. We have intuitions about things that seem desirable or undesirable, intuitions about actions that are right or wrong, intuitions about how to resolve conflicting intuitions, intuitions about how to systematize specific intuitions into general principles.
And this may not be too different from what you have in mind when you say “unrestricted iterated self-modification” but I wanted to point out that we could easily diverge in reflective equilibrium even without “hardware” self-modification, just by thinking and applying our intuitions, if those intuitions and especially meta level intuitions differ at the start. (And I do think this is pretty obvious, so it confuses me when Eliezer does not acknowledge it when he talks about CEV.)
So this interpretation of “reflective equilibrium” is almost useless, right?
I’m not sure what you mean, but in this case it seems at least useful for showing that we don’t have an argument showing that our “actual values” are complex. (Do you mean it’s not useful as a way to build FAI?)
(Do you mean it’s not useful as a way to build FAI?)
Yes.
we don’t have an argument showing that our “actual values” are complex
Do you agree that FAI probably needs to have a complex utility function, because most simple ones lead to futures we wouldn’t want to happen? The answer to that question doesn’t seem to depend on notions like reflective equilibrium or Yvain’s “actual values”, unless I’m missing something again.
Do you agree that FAI probably needs to have a complex utility function, because most simple ones lead to futures we wouldn’t want to happen?
How would I know that unless I knew what I “want”? What notion of “want” are you thinking of, if not something like “values endorsed in reflective equilibrium”? I assume you’re not thinking of “want” as opposed to “like” …
Perhaps you mean “there aren’t any simple utility functions that I would choose to implement in an AI right now and let it run, knowing it would then take over the world” but I don’t think that shows FAI probably needs to have a complex utility function. It could just be that I need more time to think things over but will eventually decide to implement a simple utility function.
Retreating further along the line of Eliezer’s reasoning to find the point where you start to disagree: how about AIs that don’t take over the world? For example, I want an AI that I can ask for a cheeseburger, and it will produce a cheeseburger for me while respecting my implied wishes to not burn the world with molecular nanotech or kill the neighbor’s dog for meat. Do you agree that such a device needs to have lots of specific knowledge about humans, and not just about cheeseburgers? If yes, then how is the goal of solving the world’s problems (saving kids in Africa, stopping unfriendly AIs, etc) relevantly different from the goal of making a cheeseburger?
Cousin_it and I had an offline chat. To recap my arguments:
It’s not clear that a cheeseburger-making AI needs to have lots of specific knowledge about humans. As we discussed, one possibility is to give it a utility function that assigns negative value to anything consequential crossing some geographical boundary, except a cheeseburger.
More generally, the fact that we can’t easily think of a way to solve a particular real-world problem (with minimal side effects) using an AI with a simple utility function is only weak evidence that such a simple utility function doesn’t exist, since the space of utility functions simple enough to be hand coded is still enormous.
Even if there are some real-world problems that can’t be solved using simple utility functions, I do not just want to solve particular problem. I want to get “what I really want” (in some sense that I can’t define clearly, but is more like “my reflective equilibrium values” than “my current behavioral tendencies”), and it seems plausible that “what I really want” is less complex than the information needed to tell an AI how to solve particular problems while keeping the world otherwise unchanged.
I think Eliezer’s “thou art godshatter” argument was meant to be generally applicable: if it was sound, then we can conclude that anyone who thinks their values are simple is wrong in an objective sense. We no longer seem to have such an argument (that is still viable) and those who are proponents for “complexity of value” perhaps have to retreat to something like “people who think they have simple values are wrong in the sense that their plans would be bad according to my values”.
“thou art godshatter” argument … We no longer seem to have such an argument (that is still viable)
How about this restatement (focus), I think you agreed previously, and your response was more subtle that denying it. IIRC, you said that complex aspects of value are present, but possibly of low importance, so if we have no choice (time), we should just go the simple way, while still capturing nontrivial amount of value. This is different from saying that value could be actually simple.
There are many possible values, and we could construct agents that optimize any one of them. Among these many values only few are simple in abstract (i.e. not referring to artifacts in the world) and explicit form. What kind of agents are human, which of the many possible values are associated with them in a sufficiently general sense of value-association that applies to humans? For humans to have specifically those few-of-many simple values, they would need to be constructed in a clean way with an explicit goal architecture that places those particular abstract goals in charge. But humans are not like that, they have a lot of accidental complexity in every joint, so there is no reason to expect them to implement that particular simple goal system.
This is an antiprediction, it argues from a priori improbability, rather than tries to shift an existing position that favors simple values. At this point, it intuitively feels to me that simple values are unprivileged, and there seems to be no known reason to expect that they would be more likely than anything else (that has more data). This kind of simplicity is not the kind from Occam’s razor: it seems like expecting air molecules in the room to keep in one corner, arranged on a regular grid, as opposed to being distributed in a macroscopically more uniform, but microscopically enormously detailed configuration. We have this brain-artifact that is known to hold lots of data, and expecting all this data to amount to a simple summary doesn’t look plausible to me.
I think you agreed previously, and your response was more subtle that denying it. IIRC, you said that complex aspects of value are present, but possibly of low importance, so if we have no choice (time), we should just go the simple way, while still capturing nontrivial amount of value. This is different from saying that value could be actually simple.
Yes, this post is making a different point from that one.
But humans are not like that, they have a lot of accidental complexity in every joint, so there is no reason to expect them to implement that particular simple goal system.
But why think all that complexity is relevant? Surely at least some of the complexity is not relevant (for example a Tourette sufferer’s tendency to curse at random, or the precise ease with which some people get addicted to gambling). Don’t you need an additional argument to conclude that at least a substantial fraction of the complexity is relevant, and that this applies generally, even to those who think otherwise and would tend to discard such complexity when they reflect on what they really want?
Don’t you need an additional argument to conclude that at least a substantial fraction of the complexity is relevant, and that this applies generally, even to those who think otherwise and would tend to discard such complexity when they reflect on what they really want?
Again, it’s an antiprediction, argument about what your prior should be, and not an argument that takes the presumption of simple values being plausible as a starting point and then tries to convince that it should be tuned down. Something is relevant, brains are probably relevant, this is where decision-making happens. The claim that the relevant decision-theoretic summary of that has any given rare property, like simplicity, is something that needs strong evidence, if you start from that prior. I don’t see why privileging simplicity is a viable starting point, why this hypothesis deserves any more consideration than the claim that the future should contain a perfect replica of myself from the year 1995.
Why is simplicity assumed to be a rare property? There are large classes of things that tend to be “simple”, like much of mathematics; don’t you have to argue that “brains” or minds belong to the class of things whose members do not tend to be “simple”? (I can see obvious reasons that one would think this is obvious, and non-obvious reasons that one would think this is non-obvious, which makes me think that it shouldn’t be assumed, but if I’m the only person around who could make the non-obvious arguments then I guess we’re out of luck.)
When answer is robustly unattainable, it’s pointless to speculate what it might be, you can only bet or build conditional plans. If “values” are “simple”, but you don’t know that, your state of knowledge about the “values” remains non-simple, and that is what you impart to AI. What does it matter which of these things, the confused state of knowledge or the correct answer to our confusion, do we call “values”?
If I think the correct answer to our confusion will ultimately turn out to be something complex (in the sense of Godshatter-like), then I can rule out any plans that eventually call for hard coding such an answer into an AI. This seems to be Eliezer’s argument (or one of his main arguments) for implementing CEV.
On the other hand, if I think the correct answer may turn out to be simple, even if I don’t know what it is now, then there’s a chance I can find out the answer directly in the next few decades and then hard code that answer into an AI. Something like CEV is no longer the obvious best approach.
(Personally I still prefer a “meta-ethical” or “meta-philosophical” approach, but we’d need a different argument for it besides “thou art godshatter”/”complexity of value”.)
I don’t know, but at least it seems plausible that I might. It’s clear that I have boredom as an emotion and as a behavior, but I don’t see a clear reason why I would want to make it into a preference. Certainly there are times when I wish I wouldn’t get bored as easily as I actually do, so I don’t want to translate boredom into a preference “as is”.
If I think about why I might not want a future where everyone has no boredom and could enjoy the same thing over and over again, what’s driving that seems to be an intuitive aversion towards triviality (i.e., things that look too easy or lack challenge). And if I think more about why I might not want a future where things look easy instead of difficult, I can’t really think of anything except that it probably has to do with signaling that I’m someone who likes challenge, which does not seem like something I really want to base my “actual preferences” on.
These are similar for me. If I think about them long enough, it just becomes unclear why I might want to keep them.
Also, perhaps “keep” and “discard” aren’t the right word here. What you actually need to do (in this view of what values are), is affirmatively create preferences (e.g., a utility function) from your intuitions. So for any potential value under consideration, you need reasons not for why you would want to “discard”, but for why you would want to “create”.
This should’ve been obvious from the start, but your comment has forced me to realize it only now: if we understand reflective equilibrium as the end result of unrestricted iterated self-modification, then it’s very sensitive to starting conditions. You and I could end up having very different value systems because I’d begin my self-modification by strengthening my safeguards against simplification of values, while you’d begin by weakening yours. And a stupid person doing unrestricted iterated self-modification will just end up someplace stupid. So this interpretation of “reflective equilibrium” is almost useless, right?
The interpretation of “reflective equilibrium” that I currently have in mind is something like this (written by Eliezer), which I think is pretty close to Yvain’s version as well:
And this may not be too different from what you have in mind when you say “unrestricted iterated self-modification” but I wanted to point out that we could easily diverge in reflective equilibrium even without “hardware” self-modification, just by thinking and applying our intuitions, if those intuitions and especially meta level intuitions differ at the start. (And I do think this is pretty obvious, so it confuses me when Eliezer does not acknowledge it when he talks about CEV.)
I’m not sure what you mean, but in this case it seems at least useful for showing that we don’t have an argument showing that our “actual values” are complex. (Do you mean it’s not useful as a way to build FAI?)
Yes.
Do you agree that FAI probably needs to have a complex utility function, because most simple ones lead to futures we wouldn’t want to happen? The answer to that question doesn’t seem to depend on notions like reflective equilibrium or Yvain’s “actual values”, unless I’m missing something again.
How would I know that unless I knew what I “want”? What notion of “want” are you thinking of, if not something like “values endorsed in reflective equilibrium”? I assume you’re not thinking of “want” as opposed to “like” …
Perhaps you mean “there aren’t any simple utility functions that I would choose to implement in an AI right now and let it run, knowing it would then take over the world” but I don’t think that shows FAI probably needs to have a complex utility function. It could just be that I need more time to think things over but will eventually decide to implement a simple utility function.
Retreating further along the line of Eliezer’s reasoning to find the point where you start to disagree: how about AIs that don’t take over the world? For example, I want an AI that I can ask for a cheeseburger, and it will produce a cheeseburger for me while respecting my implied wishes to not burn the world with molecular nanotech or kill the neighbor’s dog for meat. Do you agree that such a device needs to have lots of specific knowledge about humans, and not just about cheeseburgers? If yes, then how is the goal of solving the world’s problems (saving kids in Africa, stopping unfriendly AIs, etc) relevantly different from the goal of making a cheeseburger?
Cousin_it and I had an offline chat. To recap my arguments:
It’s not clear that a cheeseburger-making AI needs to have lots of specific knowledge about humans. As we discussed, one possibility is to give it a utility function that assigns negative value to anything consequential crossing some geographical boundary, except a cheeseburger.
More generally, the fact that we can’t easily think of a way to solve a particular real-world problem (with minimal side effects) using an AI with a simple utility function is only weak evidence that such a simple utility function doesn’t exist, since the space of utility functions simple enough to be hand coded is still enormous.
Even if there are some real-world problems that can’t be solved using simple utility functions, I do not just want to solve particular problem. I want to get “what I really want” (in some sense that I can’t define clearly, but is more like “my reflective equilibrium values” than “my current behavioral tendencies”), and it seems plausible that “what I really want” is less complex than the information needed to tell an AI how to solve particular problems while keeping the world otherwise unchanged.
I think Eliezer’s “thou art godshatter” argument was meant to be generally applicable: if it was sound, then we can conclude that anyone who thinks their values are simple is wrong in an objective sense. We no longer seem to have such an argument (that is still viable) and those who are proponents for “complexity of value” perhaps have to retreat to something like “people who think they have simple values are wrong in the sense that their plans would be bad according to my values”.
How about this restatement (focus), I think you agreed previously, and your response was more subtle that denying it. IIRC, you said that complex aspects of value are present, but possibly of low importance, so if we have no choice (time), we should just go the simple way, while still capturing nontrivial amount of value. This is different from saying that value could be actually simple.
There are many possible values, and we could construct agents that optimize any one of them. Among these many values only few are simple in abstract (i.e. not referring to artifacts in the world) and explicit form. What kind of agents are human, which of the many possible values are associated with them in a sufficiently general sense of value-association that applies to humans? For humans to have specifically those few-of-many simple values, they would need to be constructed in a clean way with an explicit goal architecture that places those particular abstract goals in charge. But humans are not like that, they have a lot of accidental complexity in every joint, so there is no reason to expect them to implement that particular simple goal system.
This is an antiprediction, it argues from a priori improbability, rather than tries to shift an existing position that favors simple values. At this point, it intuitively feels to me that simple values are unprivileged, and there seems to be no known reason to expect that they would be more likely than anything else (that has more data). This kind of simplicity is not the kind from Occam’s razor: it seems like expecting air molecules in the room to keep in one corner, arranged on a regular grid, as opposed to being distributed in a macroscopically more uniform, but microscopically enormously detailed configuration. We have this brain-artifact that is known to hold lots of data, and expecting all this data to amount to a simple summary doesn’t look plausible to me.
Yes, this post is making a different point from that one.
But why think all that complexity is relevant? Surely at least some of the complexity is not relevant (for example a Tourette sufferer’s tendency to curse at random, or the precise ease with which some people get addicted to gambling). Don’t you need an additional argument to conclude that at least a substantial fraction of the complexity is relevant, and that this applies generally, even to those who think otherwise and would tend to discard such complexity when they reflect on what they really want?
Again, it’s an antiprediction, argument about what your prior should be, and not an argument that takes the presumption of simple values being plausible as a starting point and then tries to convince that it should be tuned down. Something is relevant, brains are probably relevant, this is where decision-making happens. The claim that the relevant decision-theoretic summary of that has any given rare property, like simplicity, is something that needs strong evidence, if you start from that prior. I don’t see why privileging simplicity is a viable starting point, why this hypothesis deserves any more consideration than the claim that the future should contain a perfect replica of myself from the year 1995.
Why is simplicity assumed to be a rare property? There are large classes of things that tend to be “simple”, like much of mathematics; don’t you have to argue that “brains” or minds belong to the class of things whose members do not tend to be “simple”? (I can see obvious reasons that one would think this is obvious, and non-obvious reasons that one would think this is non-obvious, which makes me think that it shouldn’t be assumed, but if I’m the only person around who could make the non-obvious arguments then I guess we’re out of luck.)
When answer is robustly unattainable, it’s pointless to speculate what it might be, you can only bet or build conditional plans. If “values” are “simple”, but you don’t know that, your state of knowledge about the “values” remains non-simple, and that is what you impart to AI. What does it matter which of these things, the confused state of knowledge or the correct answer to our confusion, do we call “values”?
If I think the correct answer to our confusion will ultimately turn out to be something complex (in the sense of Godshatter-like), then I can rule out any plans that eventually call for hard coding such an answer into an AI. This seems to be Eliezer’s argument (or one of his main arguments) for implementing CEV.
On the other hand, if I think the correct answer may turn out to be simple, even if I don’t know what it is now, then there’s a chance I can find out the answer directly in the next few decades and then hard code that answer into an AI. Something like CEV is no longer the obvious best approach.
(Personally I still prefer a “meta-ethical” or “meta-philosophical” approach, but we’d need a different argument for it besides “thou art godshatter”/”complexity of value”.)