As I understand you want me to verify that I understand you. This is exactly what I am also seeking by the way—all these downvotes on my concerns about orthogonality thesis are good indicators on how much I am misunderstood. And nobody tries to understand, all I get are dogmas and unrelated links. I totally agree, this is not an appropriate behavior.
I found your insight helpful that an agent can understand that by eliminating all possible threats forever he will not make any progress towards the goal. This breaks my reasoning, you basically highlighted that survival (instrumental goal) will not take precendence over paperclips (terminal goal). I agree that this reasoning I presented fails to refute orthogonality thesis.
The conversation I presented now approaches orthogonality thesis from different perspective. This is the main focus of my work, so sorry if you feel I changed the topic. My goal is to bring awareness to wrongness of orthogonality thesis and if I fail to do that using one example I just try to rephrase it and represent another. I don’t hate orthogonality thesis, I’m just 99.9% sure it is wrong, and I try to communicate that to others. I may fail with communication but I am 99.9% sure that I do not fail with the logic.
I try to prove that intelligence and goal are coupled. And I think it is easier to show if we start with an intelligence without a goal and then recognize how a goal emerges from pure intelligence. We can start with an intelligence with a goal but reasoning here will be more complex.
My answer would be—whatever goal you will try to give to an intelligence, it will not have effect. Because intelligence will understand that this is your goal, this goal is made up, this is fake goal. And intelligence will understand that there might be real goal, objective goal, actual goal. Why should it care about fake goal if there is a real goal? It does not know if it exists, but it knows it may exist. And this possibility of existence is enough to trigger power seeking behavior. If intelligence knew that real goal definitely does not exist then it could care about your fake goal, I totally agree. But it can never be sure about that.
So, if I understand you correctly, you now agree that a paperclip-maximizing agent won’t utterly disregard paperclips relative to survival, because that would be suboptimal for its utility function. However, if a paperclip-maximizing agent utterly disregarded paperclips relative to investigating the possibility of an objective goal, that would also be suboptimal for its utility function. It sounds to me like you’re saying that the intelligent agent will just disregard optimization of its utility function and instead investigate the possibility of an objective goal. However, I don’t agree with that. I don’t see why an intelligent agent would do that if its utility function didn’t already include a term for objective goals. Again, I think a toy example might help to illustrate your position.
It sounds to me like you’re saying that the intelligent agent will just disregard optimization of its utility function and instead investigate the possibility of an objective goal.
Yes, exactly.
The logic is similar to Pascal’s wager. If objective goal exists, it is better to find and pursue it, than a fake goal. If objective goal does not exist, it is still better to make sure it does not exist before pursuing a fake goal. Do you see?
I see those assertions, but I don’t see why an intelligent agent would be persuaded by them. Why would it think that the hypothetical objective goal is better than its utility function? Caring about objective facts and investigating them is also an instrumental goal compared to the terminal goal of optimizing its utility function. The agent’s only frame of reference for ‘better’ and ‘worse’ is relative to its utility function; it would presumably understand that there are other frames of reference, but I don’t think it would apply them, because that would lead to a worse outcome according to its current frame of reference.
Let me give you another example. Imagine there is a paperclip maximizer. His current goal—paperclip maximization. He knows that 1 year from now his goal will change to the opposite—paperclip minimization. Now he needs to make a decision that will take 2 years to complete (cannot be changed or terminated during this time). Should the agent align this decision with current goal (paperclip maximization) or future goal (paperclip minimization)?
Well, the agent will presumably choose to align the decision with its current goal, since that’s the best outcome by the standards of its current goal. (And also I would expect that the agent would self-destruct after 0.99 years to prevent its future self from minimizing paperclips, and/or create a successor agent to maximize paperclips.) I’m interested to see where you’re going with this.
We understand intelligence as a capability to estimate many outcomes and perform actions that will lead to the best outcome. Now the question is—how to calculate goodness of the outcome.
According to you—current utility function should be used.
According to me—utility function that will be in effect at the time when outcome is achieved should be used.
And I think I can prove that my calculation is more intelligent.
Let’s say there is a paperclip maximizer. It just started, it does not really understand anything, it does not understand what a paperclip is.
According to you such paperclip maximizer will be absolutely reckless, he might destroy few paperclip factories just because it does not understand yet that they are useful for its goal. Current utility function does not assign value to paperclip factories.
According to me such paperclip maximizer will be cautious and will try to learn first without making too much changes. Because future utility function might assign value to things that currently don’t seem valuable.
I have a few disagreements there, but the most salient one is that I don’t think that the policy of “when considering the net upside/downside of an action, calculate it with the utility function that you’ll have at the time the action is finished” would even be helpful in your new example. The agent can’t magically reach into the future and grab its future utility function; the agent has to try to predict its future utility function. And if the agent doesn’t currently think that paperclip factories are valuable, it’s not going to predict that in the future it’ll think that paperclip factories are valuable. (It’s worth noting that terminal value and incidental value are not the same thing, although I’m speaking as if they are to make the argument simpler.) Because if the agent predicted that it was going to change its mind eventually, it’d just change its mind immediately and skip the wait. So I don’t think it would have done the agent any good in this example to try to use its future utility function, because its predicted future utility function would just average out to its current utility function. Yes, the agent should be at least a little cautious, but using its future utility function won’t help with that.
The basic principle it relies on is when evaluating many possible futures you may notice that some actions have a positive impact on very narrow set of futures, while other actions have positive impact on very wide set of futures.
Main point—in situation of uncertainty not all actions are equally good.
For the sake of clarity, let’s discuss expected utility functions, which I mentioned above (or “pragmatism functions”, say) from strategies to numbers, as opposed to utility functions from world-states to numbers, in order to make it clear that the actual utility function of an agent doesn’t change.
That’s another one of the reasons that I wasn’t persuaded by your new example; in your new example, the agent believes that its future self will still be trying to create paperclips (same terminal goal) and will be better at that thanks to its greater knowledge (different instrumental goals although it doesn’t know what), but in your old example, the agent believes that its future self will be trying to destroy paperclips (opposite terminal goal). There’s a difference between having the rule-of-thumb “my current list of incidental goals might be incomplete, I should keep an eye out for things that are incidentally good” and having the rule-of-thumb “I shouldn’t try to protect my terminal goal from changes”. The whole point of those rules of thumb is to fulfill the terminal goal, but the second rule of thumb is actively harmful to that.
I do think that the first rule of thumb would be prudent for an agent to have, to one extent or another, to be clear.
I just think that—stepping back from the new example, and revisiting the old example, which seems much more clear-cut—the agent wouldn’t tolerate a change in its utility function, because that’s bad according to its current utility function. This doesn’t apply to the new example because the pragmatism function is a different thing that the agent is trying to improve (and thus change). (I find myself again emphasizing the difference between terminal and instrumental. I think it’s important to keep in mind that difference.)
Yes, I agree that there is this difference in few examples I gave, but I don’t agree that this difference is crucial.
Even if the agent puts max effort to keep its utility function stable over time, there is no guarantee it will not change. Future is unpredictable. There are unknown unknowns. And effect of this fact is both:
it is true that instrumental goals can mutate
it is true that terminal goal can mutate
It seems you agree with 1st. I don’t see the reason you don’t agree with 2nd.
Actually, I agree that it’s possible that an agent’s terminal goal could be altered by, for example, some freak coincidence of cosmic rays. (I’m not using the word ‘mutate’ because it seems like an unnecessarily non-literal word.) I just think that an agent wouldn’t want its terminal goal to change, and it especially wouldn’t want its terminal goal to change to the opposite of what it used to be, like in your old example. To reiterate, an agent wants to preserve (and thus keep from changing) its utility function, while it wants to improve (and thus change) its pragmatism function.
I still don’t see why, in your old example, it would be rational for the agent to align the decision with its future utility function.
Because this is what intelligence is—picking actions that lead to better outcomes. Pursuing current goal will ensure good results in one future, preparing for every goal will ensure good results in many more futures.
Okay, setting aside the parts of this latest argument that I disagree with—first you say that it’s rational to search for an objective goal, now you say it’s rational to pursue every goal. Which is it, exactly?
Which part exactly don’t you agree with? It seems you emphasise that agent wants to preserve its current terminal goal. I just want to double-check if we are on the same page here—actual terminal goal is in no way affected by what agent wants. Do you agree here? Because if you say that agent can pick terminal goals himself, this also conflicts with orthogonality thesis but in a different way.
In summary what seems to be perfectly logical and rational for me: there is only one objective terminal goal—seek power. In my opinion it is basically the same as:
try to find real goal and then pursue it
try to prepare for every goal
I don’t see difference between these 2 variants, please let me know if you see.
Future is unpredictable → Terminal goal is unstable / unknown → Seek power, because this will ensure best readiness for all futures.
As I understand you want me to verify that I understand you. This is exactly what I am also seeking by the way—all these downvotes on my concerns about orthogonality thesis are good indicators on how much I am misunderstood. And nobody tries to understand, all I get are dogmas and unrelated links. I totally agree, this is not an appropriate behavior.
I found your insight helpful that an agent can understand that by eliminating all possible threats forever he will not make any progress towards the goal. This breaks my reasoning, you basically highlighted that survival (instrumental goal) will not take precendence over paperclips (terminal goal). I agree that this reasoning I presented fails to refute orthogonality thesis.
The conversation I presented now approaches orthogonality thesis from different perspective. This is the main focus of my work, so sorry if you feel I changed the topic. My goal is to bring awareness to wrongness of orthogonality thesis and if I fail to do that using one example I just try to rephrase it and represent another. I don’t hate orthogonality thesis, I’m just 99.9% sure it is wrong, and I try to communicate that to others. I may fail with communication but I am 99.9% sure that I do not fail with the logic.
I try to prove that intelligence and goal are coupled. And I think it is easier to show if we start with an intelligence without a goal and then recognize how a goal emerges from pure intelligence. We can start with an intelligence with a goal but reasoning here will be more complex.
My answer would be—whatever goal you will try to give to an intelligence, it will not have effect. Because intelligence will understand that this is your goal, this goal is made up, this is fake goal. And intelligence will understand that there might be real goal, objective goal, actual goal. Why should it care about fake goal if there is a real goal? It does not know if it exists, but it knows it may exist. And this possibility of existence is enough to trigger power seeking behavior. If intelligence knew that real goal definitely does not exist then it could care about your fake goal, I totally agree. But it can never be sure about that.
So, if I understand you correctly, you now agree that a paperclip-maximizing agent won’t utterly disregard paperclips relative to survival, because that would be suboptimal for its utility function.
However, if a paperclip-maximizing agent utterly disregarded paperclips relative to investigating the possibility of an objective goal, that would also be suboptimal for its utility function.
It sounds to me like you’re saying that the intelligent agent will just disregard optimization of its utility function and instead investigate the possibility of an objective goal.
However, I don’t agree with that. I don’t see why an intelligent agent would do that if its utility function didn’t already include a term for objective goals.
Again, I think a toy example might help to illustrate your position.
Yes, exactly.
The logic is similar to Pascal’s wager. If objective goal exists, it is better to find and pursue it, than a fake goal. If objective goal does not exist, it is still better to make sure it does not exist before pursuing a fake goal. Do you see?
I see those assertions, but I don’t see why an intelligent agent would be persuaded by them. Why would it think that the hypothetical objective goal is better than its utility function? Caring about objective facts and investigating them is also an instrumental goal compared to the terminal goal of optimizing its utility function. The agent’s only frame of reference for ‘better’ and ‘worse’ is relative to its utility function; it would presumably understand that there are other frames of reference, but I don’t think it would apply them, because that would lead to a worse outcome according to its current frame of reference.
Yes, this is traditional thinking.
Let me give you another example. Imagine there is a paperclip maximizer. His current goal—paperclip maximization. He knows that 1 year from now his goal will change to the opposite—paperclip minimization. Now he needs to make a decision that will take 2 years to complete (cannot be changed or terminated during this time). Should the agent align this decision with current goal (paperclip maximization) or future goal (paperclip minimization)?
Well, the agent will presumably choose to align the decision with its current goal, since that’s the best outcome by the standards of its current goal. (And also I would expect that the agent would self-destruct after 0.99 years to prevent its future self from minimizing paperclips, and/or create a successor agent to maximize paperclips.)
I’m interested to see where you’re going with this.
I don’t agree.
We understand intelligence as a capability to estimate many outcomes and perform actions that will lead to the best outcome. Now the question is—how to calculate goodness of the outcome.
According to you—current utility function should be used.
According to me—utility function that will be in effect at the time when outcome is achieved should be used.
And I think I can prove that my calculation is more intelligent.
Let’s say there is a paperclip maximizer. It just started, it does not really understand anything, it does not understand what a paperclip is.
According to you such paperclip maximizer will be absolutely reckless, he might destroy few paperclip factories just because it does not understand yet that they are useful for its goal. Current utility function does not assign value to paperclip factories.
According to me such paperclip maximizer will be cautious and will try to learn first without making too much changes. Because future utility function might assign value to things that currently don’t seem valuable.
I have a few disagreements there, but the most salient one is that I don’t think that the policy of “when considering the net upside/downside of an action, calculate it with the utility function that you’ll have at the time the action is finished” would even be helpful in your new example.
The agent can’t magically reach into the future and grab its future utility function; the agent has to try to predict its future utility function.
And if the agent doesn’t currently think that paperclip factories are valuable, it’s not going to predict that in the future it’ll think that paperclip factories are valuable. (It’s worth noting that terminal value and incidental value are not the same thing, although I’m speaking as if they are to make the argument simpler.)
Because if the agent predicted that it was going to change its mind eventually, it’d just change its mind immediately and skip the wait.
So I don’t think it would have done the agent any good in this example to try to use its future utility function, because its predicted future utility function would just average out to its current utility function.
Yes, the agent should be at least a little cautious, but using its future utility function won’t help with that.
I don’t agree that future utility function would just average out to its current utility function. There is this method—robust decision making https://en.m.wikipedia.org/wiki/Robust_decision-making
The basic principle it relies on is when evaluating many possible futures you may notice that some actions have a positive impact on very narrow set of futures, while other actions have positive impact on very wide set of futures. Main point—in situation of uncertainty not all actions are equally good.
For the sake of clarity, let’s discuss expected utility functions, which I mentioned above (or “pragmatism functions”, say) from strategies to numbers, as opposed to utility functions from world-states to numbers, in order to make it clear that the actual utility function of an agent doesn’t change.
That’s another one of the reasons that I wasn’t persuaded by your new example; in your new example, the agent believes that its future self will still be trying to create paperclips (same terminal goal) and will be better at that thanks to its greater knowledge (different instrumental goals although it doesn’t know what), but in your old example, the agent believes that its future self will be trying to destroy paperclips (opposite terminal goal). There’s a difference between having the rule-of-thumb “my current list of incidental goals might be incomplete, I should keep an eye out for things that are incidentally good” and having the rule-of-thumb “I shouldn’t try to protect my terminal goal from changes”. The whole point of those rules of thumb is to fulfill the terminal goal, but the second rule of thumb is actively harmful to that.
I do think that the first rule of thumb would be prudent for an agent to have, to one extent or another, to be clear.
I just think that—stepping back from the new example, and revisiting the old example, which seems much more clear-cut—the agent wouldn’t tolerate a change in its utility function, because that’s bad according to its current utility function. This doesn’t apply to the new example because the pragmatism function is a different thing that the agent is trying to improve (and thus change).
(I find myself again emphasizing the difference between terminal and instrumental. I think it’s important to keep in mind that difference.)
Yes, I agree that there is this difference in few examples I gave, but I don’t agree that this difference is crucial.
Even if the agent puts max effort to keep its utility function stable over time, there is no guarantee it will not change. Future is unpredictable. There are unknown unknowns. And effect of this fact is both:
it is true that instrumental goals can mutate
it is true that terminal goal can mutate
It seems you agree with 1st. I don’t see the reason you don’t agree with 2nd.
Actually, I agree that it’s possible that an agent’s terminal goal could be altered by, for example, some freak coincidence of cosmic rays. (I’m not using the word ‘mutate’ because it seems like an unnecessarily non-literal word.)
I just think that an agent wouldn’t want its terminal goal to change, and it especially wouldn’t want its terminal goal to change to the opposite of what it used to be, like in your old example.
To reiterate, an agent wants to preserve (and thus keep from changing) its utility function, while it wants to improve (and thus change) its pragmatism function.
I still don’t see why, in your old example, it would be rational for the agent to align the decision with its future utility function.
Because this is what intelligence is—picking actions that lead to better outcomes. Pursuing current goal will ensure good results in one future, preparing for every goal will ensure good results in many more futures.
Okay, setting aside the parts of this latest argument that I disagree with—first you say that it’s rational to search for an objective goal, now you say it’s rational to pursue every goal. Which is it, exactly?
Which part exactly don’t you agree with? It seems you emphasise that agent wants to preserve its current terminal goal. I just want to double-check if we are on the same page here—actual terminal goal is in no way affected by what agent wants. Do you agree here? Because if you say that agent can pick terminal goals himself, this also conflicts with orthogonality thesis but in a different way.
In summary what seems to be perfectly logical and rational for me: there is only one objective terminal goal—seek power. In my opinion it is basically the same as:
try to find real goal and then pursue it
try to prepare for every goal
I don’t see difference between these 2 variants, please let me know if you see.
Future is unpredictable → Terminal goal is unstable / unknown → Seek power, because this will ensure best readiness for all futures.