I don’t think that this describes the best possible level of understanding. It would be even better to say, “ok, I see now how and why I came to possess these values in the first place”, even if the answer to that is, “there’s no good reason for it, these values are arbitrary”. It’s the difference between saying “this mountain grows by 0.03m per year” and “I know all about plate tectonics”. Unfortunately, we humans would not be able to answer the question in that much detail; the best we could hope for is to say, “yep, we possess these values because they’re the best possible values to have, duh”.
I think we’re already in a better position to analyze our own values than that; we can assess them in terms of game theory and our evolutionary environment.
How do I know where that point is ?
I would say if you suspect that a course of action could realistically result in an alteration of your fundamental values, you are at or past it.
I suppose this depends on what you mean by “compel”. Knowing about my own psychology would certainly enable me to change my values, and there are certain (admittedly, non-terminal) values that I wouldn’t mind changing, if I could.
For example, I personally can’t stand the taste of beer, but I know that most people enjoy it; so I wouldn’t mind changing that value if I could, in order to avoid missing out on a potentially fun experience.
By “values”, I’ve implicitly been referring to terminal values, I’m sorry for being unclear. I’m not sure it makes sense to describe liking the taste of beer as a “value,” as such, just a taste, since you don’t carry any judgment about beer being good or bad or have any particular attachment to your current opinion.
I don’t think this is possible. How would it know what changes they would make, without letting them make these changes, even in a sandbox ? I suppose one answer is, “it would avoid instantiating full copies, and use some heuristics to build a probabilistic model instead”—is that similar to what you’re thinking of ?
It could use heuristics to build a probabilistic model (probably more efficient in terms of computation per expected value of information,) use sandboxed copies which don’t have the power to affect the software of the real Clippy, or halt the simulation at the point where the altered instantiation decides what changes to make.
Since self-optimization is one of Clippy’s key instrumental goals, it would want to acquire as much knowledge about oneself as is practical, in order to optimize itself more efficiently.
I think that this is going well beyond the extent of “practical” in terms of programming resources per expected value of information.
Your objection sounds to me as similar to saying, “since biology is a consequence of physics, shouldn’t we just study physics instead ?”. Well, yes, ultimately everything is a consequence of physics, but sometimes it makes more sense to study cells than quarks.
I don’t see how observing what changes instantiations of itself with different value systems would make to its code would help it observe objective morality in the sense you described, even if it should happen to exist. I think that this would be the wrong level of abstraction at which to launch an examination, like trying to find out about chemistry by studying sociology.
I think we’re already in a better position to analyze our own values than that; we can assess them in terms of game theory and our evolutionary environment.
Are we really ? I personally am not even sure what human fundamental values even are. I have a hunch that “seek pleasure, avoid pain” might be one of them, but beyound that I’m not sure. I don’t know to what extent our values hamper our ability to discover our values, but I suspect there’s at least some chilling effect involved.
I would say if you suspect that a course of action could realistically result in an alteration of your fundamental values, you are at or past it.
Right, but even if I knew what my terminal values were, how can I predict which actions would put me on the path to altering them ?
For example, consider non-fundamental values such as religious faith. People get converted or de-converted to/from their religion all the time; you often hear statements such as “I had no idea that studying the Bible would cause me to become an atheist, yet here I am”.
or halt the simulation at the point where the altered instantiation decides what changes to make.
Ok, let’s say that Clippy is trying to optimize itself in order to make certain types of inferences compute more efficiently, or whatever. In this case, it would need to not only watch what changes its debug-level copy wants to make, but also watch it follow through with the changes, in order to determine whether the new architecture actually is more efficient. Why would it not do the same thing with terminal values ?
I know that you want to answer,”because its current terminal values won’t let it”, but remember: Clippy is only experimenting, in order to find out more about its own thought mechanisms, and to acquire knowledge in general. It has no pre-commitment to alter itself to mirror the debug-level copy.
I think that this is going well beyond the extent of “practical” in terms of programming resources per expected value of information.
That’s kind of the problem with pure research: all of it has very low expected value, unless you are willing to look at the long term. Why mess with invisible light that no one can see or find a use for, when you could spend your time on inventing a better telegraph ?
I don’t see how observing what changes instantiations of itself with different value systems would make to its code would help it observe objective morality in the sense you described...
Well, for example, if all of its copies who survive and thrive converge on a certain subset of moral values, that would be one indication (though obviously not ironclad proof) that such values are required in order for an agent to succeed, regardless of what its other goals actually are.
Ok, let’s say that Clippy is trying to optimize itself in order to make certain types of inferences compute more efficiently, or whatever. In this case, it would need to not only watch what changes its debug-level copy wants to make, but also watch it follow through with the changes, in order to determine whether the new architecture actually is more efficient. Why would it not do the same thing with terminal values ?
If Clippy is trying to optimize itself to make inferences more efficiently, then it would want not to apply changes to its source code until its done the calculations to make sure that those changes would advance its values rather than harm them.
You wouldn’t want to use a machine that would make physical alterations to your brain in order to make you smarter, without thoroughly calculating the effects of such alterations first, otherwise it would probably just make things worse.
That’s kind of the problem with pure research: all of it has very low expected value, unless you are willing to look at the long term. Why mess with invisible light that no one can see or find a use for, when you could spend your time on inventing a better telegraph ?
In Clippy’s case though, it can use other, less computationally expensive methods to investigate approximately the same information.
I don’t think the experiments you’re suggesting Clippy might undertake are even located in a region of hypothesis space that its other information would narrow down as worth investigating. It seems to me much less like investigating unknown invisible rays than like spending hundreds of billions of dollars to build a collider which launches charged protein molecules at each other at relativistic speeds to see what would happen, when our available models suggest the answer would be “pretty much the same thing as if you launch any other kind of atoms at each other at relativistic speeds.” We have no evidence that any interesting new phenomena would arise with protein that didn’t arise on the atomic level.
Well, for example, if all of its copies who survive and thrive converge on a certain subset of moral values, that would be one indication (though obviously not ironclad proof) that such values are required in order for an agent to succeed, regardless of what its other goals actually are.
Can you explain how any moral values could have that effect, which wouldn’t be better studied at a more fundamental level like game theory, or physics?
If Clippy is trying to optimize itself to make inferences more efficiently, then it would want not to apply changes to its source code until its done the calculations...
Ok, so at what point does Clippy stop simulating the debug version of Clippy ? It does, after all, want to make the computation of its values more efficient. For example, consider a trivial scenario where one of its values basically said, “reject any action if it satisfies both A and not-A”. This is a logically inconsistent value that some programmer accidentally left in Clippy’s original source code. Would Clippy ever get around to removing it ? After all, Clippy knows that it’s applying that test to every action, so removing it should result in a decent performance boost.
I don’t think the experiments you’re suggesting Clippy might undertake are even located in a region of hypothesis space that its other information would narrow down as worth investigating.
It seems to me much less like investigating unknown invisible rays than like spending hundreds of billions of dollars to build a collider...
Why do you see the proposed experiment this way ?
Speaking more generally, how do you decide which avenues of research are worth pursuing ? You could easily answer, “whichever avenues would increase my efficiency of achieving my terminal goals”, but how do you know which avenues would actually do that ? For example, if you didn’t know anything about electricity or magnetism or the nature of light, how would your research-choosing algorithm ensure that you’d eventually stumble upon radio waves, which, as we know in hindsight, are hugely useful ?
Can you explain how any moral values could have that effect, which wouldn’t be better studied at a more fundamental level like game theory, or physics?
Physics is a bad candidate, because it is too fine-grained. If some sort of an absolute objective morality exists in the way that I described, then studying physics would eventually reveal its properties; but, as is the case with biology or ballistics, looking at everything in terms of quarks is not always practical.
Game theory is a trickier proposition. I can see two possibilities: either game theory turns out to closely relate whatever this objective morality happens to be (f.ex. like electricity vs. magnetism), or not (f.ex. like particle physics and biology). In the second case, understanding objective morality through game theory would be inefficient.
That said though, even in our current world as it actually exists there are people who study sociology and anthropology. Yes, they could get the same level of understanding through neurobiology and game theory, but it would take too long. Instead, they are taking advantage of existing human populations to study human behavior in aggregate. Reasoning your way to the answer from first principles is not always the best solution.
Ok, so at what point does Clippy stop simulating the debug version of Clippy ? It does, after all, want to make the computation of its values more efficient. For example, consider a trivial scenario where one of its values basically said, “reject any action if it satisfies both A and not-A”. This is a logically inconsistent value that some programmer accidentally left in Clippy’s original source code. Would Clippy ever get around to removing it ? After all, Clippy knows that it’s applying that test to every action, so removing it should result in a decent performance boost.
Unless I’m critically misunderstanding something here, I would think that Clippy would remove it if it calculated that removing it would result in more expected paperclips.
Why do you see the proposed experiment this way ?
Speaking more generally, how do you decide which avenues of research are worth pursuing ? You could easily answer, “whichever avenues would increase my efficiency of achieving my terminal goals”, but how do you know which avenues would actually do that ? For example, if you didn’t know anything about electricity or magnetism or the nature of light, how would your research-choosing algorithm ensure that you’d eventually stumble upon radio waves, which, as we know in hindsight, are hugely useful ?
When we didn’t know what things like radio waves or x-rays were, we didn’t know that they would be useful, but we could see that there appeared to be some sort of existing phenomena that we didn’t know how to model, so we examined them until we knew how to model them. It’s not like we performed a whole bunch of experiments in case there turned out to be invisible rays our observations had never hinted at, which could be turned to useful ends. The original observations of radio waves and x-rays came from our experiments with other known phenomena.
What you’re suggesting sounds more like experimenting completely blindly; you’re committing resources to research, not just not knowing that it will bear valuable fruit, but not having any indication that it’s going to shed light on any existing phenomenon at all. That’s why I think it’s less like investigating invisible rays than like building a protein collider; we didn’t try studying invisible rays until we had a good indication that there was an invisible something to be studied.
Unless I’m critically misunderstanding something here, I would think that Clippy would remove it if it calculated that removing it would result in more expected paperclips.
Ok, so Clippy would need to run sim-Clippy for a little while at least, just to make sure that it still produces paperclips—and that, in fact, it does so more efficiently now, since that one useless test is removed. Yes, this test used to be Clippy’s terminal goal, but it wasn’t doing anything, so Clippy took it out.
Would it be possible for Clippy to optimize his goals even further ? To use another silly example (“silly” because Clippy would be dealing with probabilities, not syllogisms), if Clippy had the goals A, B and C, but B always entailed C, would it go ahead and remove C ?
It’s not like we performed a whole bunch of experiments in case there turned out to be invisible rays our observations had never hinted at...
Understood, that makes sense. However, I believe that in my scenario, Clippy’s own behavior and his current paperclip production efficiency is what it observes; and the goal of its experiments would be to explain why his efficiency is what it is, in order to ultimately improve it.
Ok, so Clippy would need to run sim-Clippy for a little while at least, just to make sure that it still produces paperclips—and that, in fact, it does so more efficiently now, since that one useless test is removed. Yes, this test used to be Clippy’s terminal goal, but it wasn’t doing anything, so Clippy took it out.
Would it be possible for Clippy to optimize his goals even further ? To use another silly example (“silly” because Clippy would be dealing with probabilities, not syllogisms), if Clippy had the goals A, B and C, but B always entailed C, would it go ahead and remove C ?
That seems plausible.
Understood, that makes sense. However, I believe that in my scenario, Clippy’s own behavior and his current paperclip production efficiency is what it observes; and the goal of its experiments would be to explain why his efficiency is what it is, in order to ultimately improve it.
I don’t think tampering with its fundamental motivation to make paperclips is a particularly promising strategy for optimizing its paperclips production.
Ok, so now we’ve got a Clippy who a). is not too averse to tinkering with its own goals, as long as the goals remain functionally the same, b). simulates a relatively long-running version of itself, and c). is capable of examining the inner workings of both that version and itself.
You say,
I don’t think tampering with its fundamental motivation to make paperclips is a particularly promising strategy for optimizing its paperclips production.
But remember, at this stage Clippy is not changing its own fundamental motivation (beyound some outcome-invariant optimizations); it’s merely observing sim-Clippies in a controlled environment.
Do you think that Clippy would ever simulate versions of itself whose fundamental motivations were, in fact, changed ? I could see several scenarios where this might be the case, for example:
Clippy wanted to optimize some goal, but ended up accidentally changing it. Oops !
Clippy created a version with drastically reduced goals on purpose, in order to measure how much performance is affected by certain goals, thus targeting them for possible future optimization. Of course, Clippy would only want to optimize the goals, not remove them.
But remember, at this stage Clippy is not changing its own fundamental motivation (beyound some outcome-invariant optimizations); it’s merely observing sim-Clippies in a controlled environment.
Why does it do that? I said it sounded plausible that it would cut out its redundant goal, because that would save computing resources. But this sounds like we’ve gone back to experimenting blindly. Why would it think observing sim-clippies is a good use of its computing resources in order to maximize paperclips?
I’d say that Clippy simulating versions of itself whose fundamental motivations are different is much less plausible, because it’s using a lot of computing resources for something that isn’t a likely route to optimizing its paperclip production. I think this falls into the “protein collider” category. Even if it did do so, I think it would be unlikely to go from there to changing its own terminal value.
Unless I’m critically misunderstanding something here, I would think that Clippy would remove it if it calculated that removing it would result in more expected paperclips.
It would also be critical for Clippy to observe that removing that value would not result in more expected actions taken that satisfy both A and not-A; this being one of Clippy’s values at the time of modification.
Right, I misread that before. If its programming says to reject actions that says A and not-A, but this isn’t one of the standards by which it judges value, it would presumably reject it. If that is one of the standards by which it measures value, then it would depend on how that value measured against its value of paperclips and the extent to which they were in conflict.
I think we’re already in a better position to analyze our own values than that; we can assess them in terms of game theory and our evolutionary environment.
I would say if you suspect that a course of action could realistically result in an alteration of your fundamental values, you are at or past it.
By “values”, I’ve implicitly been referring to terminal values, I’m sorry for being unclear. I’m not sure it makes sense to describe liking the taste of beer as a “value,” as such, just a taste, since you don’t carry any judgment about beer being good or bad or have any particular attachment to your current opinion.
It could use heuristics to build a probabilistic model (probably more efficient in terms of computation per expected value of information,) use sandboxed copies which don’t have the power to affect the software of the real Clippy, or halt the simulation at the point where the altered instantiation decides what changes to make.
I think that this is going well beyond the extent of “practical” in terms of programming resources per expected value of information.
I don’t see how observing what changes instantiations of itself with different value systems would make to its code would help it observe objective morality in the sense you described, even if it should happen to exist. I think that this would be the wrong level of abstraction at which to launch an examination, like trying to find out about chemistry by studying sociology.
Are we really ? I personally am not even sure what human fundamental values even are. I have a hunch that “seek pleasure, avoid pain” might be one of them, but beyound that I’m not sure. I don’t know to what extent our values hamper our ability to discover our values, but I suspect there’s at least some chilling effect involved.
Right, but even if I knew what my terminal values were, how can I predict which actions would put me on the path to altering them ?
For example, consider non-fundamental values such as religious faith. People get converted or de-converted to/from their religion all the time; you often hear statements such as “I had no idea that studying the Bible would cause me to become an atheist, yet here I am”.
Ok, let’s say that Clippy is trying to optimize itself in order to make certain types of inferences compute more efficiently, or whatever. In this case, it would need to not only watch what changes its debug-level copy wants to make, but also watch it follow through with the changes, in order to determine whether the new architecture actually is more efficient. Why would it not do the same thing with terminal values ?
I know that you want to answer,”because its current terminal values won’t let it”, but remember: Clippy is only experimenting, in order to find out more about its own thought mechanisms, and to acquire knowledge in general. It has no pre-commitment to alter itself to mirror the debug-level copy.
That’s kind of the problem with pure research: all of it has very low expected value, unless you are willing to look at the long term. Why mess with invisible light that no one can see or find a use for, when you could spend your time on inventing a better telegraph ?
Well, for example, if all of its copies who survive and thrive converge on a certain subset of moral values, that would be one indication (though obviously not ironclad proof) that such values are required in order for an agent to succeed, regardless of what its other goals actually are.
If Clippy is trying to optimize itself to make inferences more efficiently, then it would want not to apply changes to its source code until its done the calculations to make sure that those changes would advance its values rather than harm them.
You wouldn’t want to use a machine that would make physical alterations to your brain in order to make you smarter, without thoroughly calculating the effects of such alterations first, otherwise it would probably just make things worse.
In Clippy’s case though, it can use other, less computationally expensive methods to investigate approximately the same information.
I don’t think the experiments you’re suggesting Clippy might undertake are even located in a region of hypothesis space that its other information would narrow down as worth investigating. It seems to me much less like investigating unknown invisible rays than like spending hundreds of billions of dollars to build a collider which launches charged protein molecules at each other at relativistic speeds to see what would happen, when our available models suggest the answer would be “pretty much the same thing as if you launch any other kind of atoms at each other at relativistic speeds.” We have no evidence that any interesting new phenomena would arise with protein that didn’t arise on the atomic level.
Can you explain how any moral values could have that effect, which wouldn’t be better studied at a more fundamental level like game theory, or physics?
Ok, so at what point does Clippy stop simulating the debug version of Clippy ? It does, after all, want to make the computation of its values more efficient. For example, consider a trivial scenario where one of its values basically said, “reject any action if it satisfies both A and not-A”. This is a logically inconsistent value that some programmer accidentally left in Clippy’s original source code. Would Clippy ever get around to removing it ? After all, Clippy knows that it’s applying that test to every action, so removing it should result in a decent performance boost.
Why do you see the proposed experiment this way ?
Speaking more generally, how do you decide which avenues of research are worth pursuing ? You could easily answer, “whichever avenues would increase my efficiency of achieving my terminal goals”, but how do you know which avenues would actually do that ? For example, if you didn’t know anything about electricity or magnetism or the nature of light, how would your research-choosing algorithm ensure that you’d eventually stumble upon radio waves, which, as we know in hindsight, are hugely useful ?
Physics is a bad candidate, because it is too fine-grained. If some sort of an absolute objective morality exists in the way that I described, then studying physics would eventually reveal its properties; but, as is the case with biology or ballistics, looking at everything in terms of quarks is not always practical.
Game theory is a trickier proposition. I can see two possibilities: either game theory turns out to closely relate whatever this objective morality happens to be (f.ex. like electricity vs. magnetism), or not (f.ex. like particle physics and biology). In the second case, understanding objective morality through game theory would be inefficient.
That said though, even in our current world as it actually exists there are people who study sociology and anthropology. Yes, they could get the same level of understanding through neurobiology and game theory, but it would take too long. Instead, they are taking advantage of existing human populations to study human behavior in aggregate. Reasoning your way to the answer from first principles is not always the best solution.
Unless I’m critically misunderstanding something here, I would think that Clippy would remove it if it calculated that removing it would result in more expected paperclips.
When we didn’t know what things like radio waves or x-rays were, we didn’t know that they would be useful, but we could see that there appeared to be some sort of existing phenomena that we didn’t know how to model, so we examined them until we knew how to model them. It’s not like we performed a whole bunch of experiments in case there turned out to be invisible rays our observations had never hinted at, which could be turned to useful ends. The original observations of radio waves and x-rays came from our experiments with other known phenomena.
What you’re suggesting sounds more like experimenting completely blindly; you’re committing resources to research, not just not knowing that it will bear valuable fruit, but not having any indication that it’s going to shed light on any existing phenomenon at all. That’s why I think it’s less like investigating invisible rays than like building a protein collider; we didn’t try studying invisible rays until we had a good indication that there was an invisible something to be studied.
Ok, so Clippy would need to run sim-Clippy for a little while at least, just to make sure that it still produces paperclips—and that, in fact, it does so more efficiently now, since that one useless test is removed. Yes, this test used to be Clippy’s terminal goal, but it wasn’t doing anything, so Clippy took it out.
Would it be possible for Clippy to optimize his goals even further ? To use another silly example (“silly” because Clippy would be dealing with probabilities, not syllogisms), if Clippy had the goals A, B and C, but B always entailed C, would it go ahead and remove C ?
Understood, that makes sense. However, I believe that in my scenario, Clippy’s own behavior and his current paperclip production efficiency is what it observes; and the goal of its experiments would be to explain why his efficiency is what it is, in order to ultimately improve it.
That seems plausible.
I don’t think tampering with its fundamental motivation to make paperclips is a particularly promising strategy for optimizing its paperclips production.
Ok, so now we’ve got a Clippy who a). is not too averse to tinkering with its own goals, as long as the goals remain functionally the same, b). simulates a relatively long-running version of itself, and c). is capable of examining the inner workings of both that version and itself.
You say,
But remember, at this stage Clippy is not changing its own fundamental motivation (beyound some outcome-invariant optimizations); it’s merely observing sim-Clippies in a controlled environment.
Do you think that Clippy would ever simulate versions of itself whose fundamental motivations were, in fact, changed ? I could see several scenarios where this might be the case, for example:
Clippy wanted to optimize some goal, but ended up accidentally changing it. Oops !
Clippy created a version with drastically reduced goals on purpose, in order to measure how much performance is affected by certain goals, thus targeting them for possible future optimization. Of course, Clippy would only want to optimize the goals, not remove them.
Why does it do that? I said it sounded plausible that it would cut out its redundant goal, because that would save computing resources. But this sounds like we’ve gone back to experimenting blindly. Why would it think observing sim-clippies is a good use of its computing resources in order to maximize paperclips?
I’d say that Clippy simulating versions of itself whose fundamental motivations are different is much less plausible, because it’s using a lot of computing resources for something that isn’t a likely route to optimizing its paperclip production. I think this falls into the “protein collider” category. Even if it did do so, I think it would be unlikely to go from there to changing its own terminal value.
It would also be critical for Clippy to observe that removing that value would not result in more expected actions taken that satisfy both A and not-A; this being one of Clippy’s values at the time of modification.
Right, I misread that before. If its programming says to reject actions that says A and not-A, but this isn’t one of the standards by which it judges value, it would presumably reject it. If that is one of the standards by which it measures value, then it would depend on how that value measured against its value of paperclips and the extent to which they were in conflict.