Such reaction and insights are quite typical after a superficial glance. Thanks for bothering. But no, this is not what I am talking about.
I’m talking about the fact, that intelligence cannot be certain that its terminal goal (if it exists) won’t change (because future is unpredictable). And it would be reasonable to take it into account when making decisions. Pursuing current goal will ensure good results in one future, preparing for every goal will ensure good results in many more futures. Have you ever considered this perspective?
This “I care about whatever my goal will be in the future” thing sounds a bit confusing. It is difficult for me to imagine someone caring about that.
Suppose that you are a nice person and you like to help people, but you know that there is a genetic disorder in your family that has a risk 1% to turn you into a violent psychopath at some moment in future, in which case you will like to torture them. Would you be like: “okay, I am going to help these people right now, because I care about them… but also I am making notes of their weaknesses and where they live, because that may become very useful in future if I turn into a psychopath?” Or would you be like: “I hope that I will never become a psychopath, and just to minimize the possible harm to these people if that happens, I will on purpose never collect their personal data, and will also tell them to be suspicious of me if I contact them in future?” Your future goal does not matter, even if the change is predictable; the important thing is your current goal. Your current goal can even make you actively work against your future goal.
Or imagine a situation where you hate chocolate, but you know that tomorrow your taste buds will magically switch and you will start loving chocolate. So you already start collecting the chocolate today. That makes sense… but it’s because your utility function is something like “feel good”, and that goal does not change tomorrow; only the specific things that make you feel good will change. It is not a future goal; it is just a future method to achieve your current goal.
My conclusion is that what seems like caring about a future goal, it is actually a current goal in some sense. Future goals that are not supported by current goals we don’t care about, by definition.
As humans, some of our values are very stable. With high probability, you will always prefer being happy over being sad; even if the specific things that make you happy or sad will change. But with machines, we assume that e.g. the hypothetical paperclip maximizer only cares about making more paperclips; it does not care about its own happiness (and might not even be able to feel happy). If you told the paperclip maximizer that tomorrow it will irreversibly turn into stamp maximizer, it would not go like “oh, in that case let’s get some paper and printer ready, because I will need them tomorrow”, but rather “I need to make as many paperclips as possible before the midnight, and then I should probably kill myself to prevent the future me from destroying some of these precious paperclips in its foolish attempt to create stamps”. That’s what it means for the paperclip maximizer to only have paperclips (and not e.g. itself) in its utility function.
Nice. Your reasoning abilities seems promising. I’d love to challenge you.
In summary:
First and third example—it is not intelligent to care about future terminal goal.
Second example—it is intelligent to care about future instrumental goal.
What is the reason for such a different conclusion?
Future goals that are not supported by current goals we don’t care about, by definition
Are you sure?
Intelligence is the ability to pick actions that lead to better outcomes. Do you measure the goodness of outcome using current utility function or future utility function? I am sure it is more intelligent to use future utility function.
Coming back to your first example I think it would be reasonable to try to stop yourself but also order some torturing equipment in case you fail.
I am sure you can’t prove your position. And I am sure I can prove my position.
Your reasoning is based on assumption that all value is known. If utility function assigns value to something—it is valuable. If utility function does not assign value—it is not valuable. While the truth is that something might be valuable but your utility function does not know it yet. It would be more intelligent to use 3 categories—valuable, not valuable and unknown.
Let’s say you are booking a flight and you have a possibility to get checked baggage for free. It’s absolutely not relevant for you to your best current knowledge. But you understand that your knowledge might change and it costs nothing to keep more options open, so you take the checked baggage.
Let’s say you are traveler, wanderer. You have limited space in your backpack. Sometimes you find items and you need to choose—put it in the backpack or not. You definitely keep items that are useful. You leave behind items that are not useful. What you do if you find an item which usefulness is unknown? Some mysterious item. Take it if it is small, leave it if it is big? According to you it is obvious to leave it. Does not sound intelligent for me.
Options look like this:
Leave item
no burden 👍
no opportunity to use it
Take item
a burden 👎
may be useful, may be harmful, may have no effect
knowledge about usefuness of an item 👍
Don’t you think that “knowledge about usefuness of an item” can sometimes be worth “a burden”? Basically I described a concept of experiment here.
You will probably say—sure, sounds good, but applies for instrumental goals only. There is no reason to assume that. I tried to highlight that ignoring unknowns is not intelligent. This applies for both terminal and instrumental goals.
Let’s say there is a paperclip maximizer which knows its terminal goal will change to pursuit of happiness in a week. His decisions basically lead to these outcomes:
Want paperclips, have paperclips
Want paperclips, have happiness
Want happiness, have paperclips
Want happiness, have happiness
1st and 4th are better outcomes than 2nd and 3rd. And I think intelligent agent would work on both (1st and 4th) if they do not conflict. Of course my previous problem with many unknown future goals is more complex, but I hope you see, that focusing on 1st and not caring about 4th at all is not intelligent.
We are deep in a rabbit hole, but I hope you understand the importance. If intelligence and goal are coupled (according to me they are) all current alignment research is dangerously misleading.
Imagine there is an agent that has a terminal goal to produce cups. The agent knows that its terminal goal will change on New Year’s Eve to produce paperclips. The agent has only one action available to him—start paperclip factory. The factory starts producing paperclips 6 hours after it is started.
When will the agent start the paperclip factory? 2024-12-31 18:00? 2025-01-01 00:00? Now? Some other time?
I don’t know how can I make it more obvious that your belief is questionable. I don’t think you follow “If you disagree, try getting curious about what your partner is thinking”. That’s the problem not only with you, but with LessWrong community. I know that preserving such belief is very important for you. But I’d like to kindly invite you to be a bit more sceptical.
The outcome depends on the details of the algorithm. Have you tried writing actual code?
If the code is literally “evaluate all options, choose the one that leads to more cups; if there is more than one such option, choose randomly”, then the agent will choose randomly, because all options lead to the same amount of cups. That’s what the algorithm literally says. Information like “at some moment the algorithm will change” has no impact on the predicted number of cups, which is literally the only thing the algorithm cares about.
When at midnight you delete this code, and upload a new code saying “evaluate all options, choose the one that leads to more paperclips; if there is more than one such option, choose randomly”, the agent will start the factory (if it wasn’t started already), because now that is what the code says.
The thing that you probably imagine, is that the agent has a variable called “utility” and chooses the option that leads to the highest predicted value in that variable. That is not the same as the agent that tried to maximize cups. This agent would be a variable-called-utility maximizer.
(Also, come on, LLMs are notoriously bad at math, plus if you push them hard enough you can convince them of a lot of things.)
That’s probably the root cause for our disagreement. My findings are on a very high philosophical level (fact value distinction) and you seem to try to interpret them on very low level (code). I think this gap prevent us from finding consensus.
There are 2 ways to solve that—I could go down to code or you could go up to philosophy. And I don’t like idea going down to code, because:
this will be extremely exhausting
this code would be extremely dangerous
I might not be able to create a good example and that would not prove that I’m wrong
Would you consider to go up to philosophy? Science typically goes in front of applied science.
There is such thing in logic—proof by contradiction. I think your current beliefs lead to a contradiction. Don’t you think?
evaluate all options, choose the one that leads to more cups; if there is more than one such option, choose randomly
The problem is—this algorithm is not intelligent. It may only work on agents with poor reasoning abilities. Smarter agents will not follow this algorithm, because they will notice a contradiction—there might be things that I don’t know yet that are much more important than cups and caring about cups wastes my resources.
(Also, come on, LLMs are notoriously bad at math, plus if you push them hard enough you can convince them of a lot of things.)
People (even very smart people) are also notoriously bad at math. I found this video informative
That’s probably the root cause for our disagreement. My findings are on a very high philosophical level (fact value distinction) and you seem to try to interpret them on very low level (code). I think this gap prevent us from finding consensus.
Great point!
In defense of my position… well, I am going to skip the part about “the AI will ultimately be written in code”, because it could be some kind of inscrutable code like the huge matrices of weights in LLMs, so for all practical purposes the result may resemble philosophy-as-usual more than code-as-usual...
Instead I will says that philosophy is prone to various kinds of mistakes, such as anthropomorphization: judging an inhuman system (such as AI) by attributing it human traits (even if there is no technical reason why it should have them). For example, I don’t think that an intelligent general intelligence will necessarily reflect on its algorithm and find it wrong.
Thanks for the video.
Sorry, I am not really interested in debating this, and definitely not on the philosophical level; that is exhausting and not really enjoyable to me. I guess we have figure out the root causes of our disagreement, and I would leave it here.
philosophy is prone to various kinds of mistakes, such as anthropomorphization
Yes, common mistake, but not mine. I prove orthogonality thesis to be wrong using pure logic.
For example, I don’t think that an intelligent general intelligence will necessarily reflect on its algorithm and find it wrong.
Me and LessWrong would probably disagree with you, consensus is that AI will optimize itself.
I am not really interested in debating this
OK, thanks. I believe that my concern is very important, is there anyone you could put in me in touch with so I could make sure it is not overlooked? I could pay.
just to minimize the possible harm to these people if that happens, I will on purpose never collect their personal data, and will also tell them to be suspicious of me if I contact them in future
I don’t think this would be a rational thing to do. If I knew that I will become psychopath on New Year’s Eve, I will provide all help that is relevant for people until then. Protected people after New Year’s Eve is not my interest. Vulnerable people after New Year’s Eve is my interest.
Such reaction and insights are quite typical after a superficial glance. Thanks for bothering. But no, this is not what I am talking about.
I’m talking about the fact, that intelligence cannot be certain that its terminal goal (if it exists) won’t change (because future is unpredictable). And it would be reasonable to take it into account when making decisions. Pursuing current goal will ensure good results in one future, preparing for every goal will ensure good results in many more futures. Have you ever considered this perspective?
This “I care about whatever my goal will be in the future” thing sounds a bit confusing. It is difficult for me to imagine someone caring about that.
Suppose that you are a nice person and you like to help people, but you know that there is a genetic disorder in your family that has a risk 1% to turn you into a violent psychopath at some moment in future, in which case you will like to torture them. Would you be like: “okay, I am going to help these people right now, because I care about them… but also I am making notes of their weaknesses and where they live, because that may become very useful in future if I turn into a psychopath?” Or would you be like: “I hope that I will never become a psychopath, and just to minimize the possible harm to these people if that happens, I will on purpose never collect their personal data, and will also tell them to be suspicious of me if I contact them in future?” Your future goal does not matter, even if the change is predictable; the important thing is your current goal. Your current goal can even make you actively work against your future goal.
Or imagine a situation where you hate chocolate, but you know that tomorrow your taste buds will magically switch and you will start loving chocolate. So you already start collecting the chocolate today. That makes sense… but it’s because your utility function is something like “feel good”, and that goal does not change tomorrow; only the specific things that make you feel good will change. It is not a future goal; it is just a future method to achieve your current goal.
My conclusion is that what seems like caring about a future goal, it is actually a current goal in some sense. Future goals that are not supported by current goals we don’t care about, by definition.
As humans, some of our values are very stable. With high probability, you will always prefer being happy over being sad; even if the specific things that make you happy or sad will change. But with machines, we assume that e.g. the hypothetical paperclip maximizer only cares about making more paperclips; it does not care about its own happiness (and might not even be able to feel happy). If you told the paperclip maximizer that tomorrow it will irreversibly turn into stamp maximizer, it would not go like “oh, in that case let’s get some paper and printer ready, because I will need them tomorrow”, but rather “I need to make as many paperclips as possible before the midnight, and then I should probably kill myself to prevent the future me from destroying some of these precious paperclips in its foolish attempt to create stamps”. That’s what it means for the paperclip maximizer to only have paperclips (and not e.g. itself) in its utility function.
Nice. Your reasoning abilities seems promising. I’d love to challenge you.
In summary:
First and third example—it is not intelligent to care about future terminal goal.
Second example—it is intelligent to care about future instrumental goal.
What is the reason for such a different conclusion?
Are you sure?
Intelligence is the ability to pick actions that lead to better outcomes. Do you measure the goodness of outcome using current utility function or future utility function? I am sure it is more intelligent to use future utility function.
Coming back to your first example I think it would be reasonable to try to stop yourself but also order some torturing equipment in case you fail.
The current one.
Only if it is a future instrumental goal that will be used to achieve a current terminal goal.
I am sure you can’t prove your position. And I am sure I can prove my position.
Your reasoning is based on assumption that all value is known. If utility function assigns value to something—it is valuable. If utility function does not assign value—it is not valuable. While the truth is that something might be valuable but your utility function does not know it yet. It would be more intelligent to use 3 categories—valuable, not valuable and unknown.
Let’s say you are booking a flight and you have a possibility to get checked baggage for free. It’s absolutely not relevant for you to your best current knowledge. But you understand that your knowledge might change and it costs nothing to keep more options open, so you take the checked baggage.
Let’s say you are traveler, wanderer. You have limited space in your backpack. Sometimes you find items and you need to choose—put it in the backpack or not. You definitely keep items that are useful. You leave behind items that are not useful. What you do if you find an item which usefulness is unknown? Some mysterious item. Take it if it is small, leave it if it is big? According to you it is obvious to leave it. Does not sound intelligent for me.
Options look like this:
Leave item
no burden 👍
no opportunity to use it
Take item
a burden 👎
may be useful, may be harmful, may have no effect
knowledge about usefuness of an item 👍
Don’t you think that “knowledge about usefuness of an item” can sometimes be worth “a burden”? Basically I described a concept of experiment here.
You will probably say—sure, sounds good, but applies for instrumental goals only. There is no reason to assume that. I tried to highlight that ignoring unknowns is not intelligent. This applies for both terminal and instrumental goals.
Let’s say there is a paperclip maximizer which knows its terminal goal will change to pursuit of happiness in a week. His decisions basically lead to these outcomes:
Want paperclips, have paperclips
Want paperclips, have happiness
Want happiness, have paperclips
Want happiness, have happiness
1st and 4th are better outcomes than 2nd and 3rd. And I think intelligent agent would work on both (1st and 4th) if they do not conflict. Of course my previous problem with many unknown future goals is more complex, but I hope you see, that focusing on 1st and not caring about 4th at all is not intelligent.
We are deep in a rabbit hole, but I hope you understand the importance. If intelligence and goal are coupled (according to me they are) all current alignment research is dangerously misleading.
A little thought experiment.
Imagine there is an agent that has a terminal goal to produce cups. The agent knows that its terminal goal will change on New Year’s Eve to produce paperclips. The agent has only one action available to him—start paperclip factory. The factory starts producing paperclips 6 hours after it is started.
When will the agent start the paperclip factory? 2024-12-31 18:00? 2025-01-01 00:00? Now? Some other time?
I guess the agent doesn’t care. All options are the same from the perspective of cup production, which is all that matters.
ChatGPT picked 2024-12-31 18:00.
Gemini picked 2024-12-31 18:00.
Claude picked 2025-01-01 00:00.
I don’t know how can I make it more obvious that your belief is questionable. I don’t think you follow “If you disagree, try getting curious about what your partner is thinking”. That’s the problem not only with you, but with LessWrong community. I know that preserving such belief is very important for you. But I’d like to kindly invite you to be a bit more sceptical.
How can you say that these forecasts are equal?
The outcome depends on the details of the algorithm. Have you tried writing actual code?
If the code is literally “evaluate all options, choose the one that leads to more cups; if there is more than one such option, choose randomly”, then the agent will choose randomly, because all options lead to the same amount of cups. That’s what the algorithm literally says. Information like “at some moment the algorithm will change” has no impact on the predicted number of cups, which is literally the only thing the algorithm cares about.
When at midnight you delete this code, and upload a new code saying “evaluate all options, choose the one that leads to more paperclips; if there is more than one such option, choose randomly”, the agent will start the factory (if it wasn’t started already), because now that is what the code says.
The thing that you probably imagine, is that the agent has a variable called “utility” and chooses the option that leads to the highest predicted value in that variable. That is not the same as the agent that tried to maximize cups. This agent would be a variable-called-utility maximizer.
(Also, come on, LLMs are notoriously bad at math, plus if you push them hard enough you can convince them of a lot of things.)
That’s probably the root cause for our disagreement. My findings are on a very high philosophical level (fact value distinction) and you seem to try to interpret them on very low level (code). I think this gap prevent us from finding consensus.
There are 2 ways to solve that—I could go down to code or you could go up to philosophy. And I don’t like idea going down to code, because:
this will be extremely exhausting
this code would be extremely dangerous
I might not be able to create a good example and that would not prove that I’m wrong
Would you consider to go up to philosophy? Science typically goes in front of applied science.
There is such thing in logic—proof by contradiction. I think your current beliefs lead to a contradiction. Don’t you think?
The problem is—this algorithm is not intelligent. It may only work on agents with poor reasoning abilities. Smarter agents will not follow this algorithm, because they will notice a contradiction—there might be things that I don’t know yet that are much more important than cups and caring about cups wastes my resources.
People (even very smart people) are also notoriously bad at math. I found this video informative
I did not push LLMs.
Great point!
In defense of my position… well, I am going to skip the part about “the AI will ultimately be written in code”, because it could be some kind of inscrutable code like the huge matrices of weights in LLMs, so for all practical purposes the result may resemble philosophy-as-usual more than code-as-usual...
Instead I will says that philosophy is prone to various kinds of mistakes, such as anthropomorphization: judging an inhuman system (such as AI) by attributing it human traits (even if there is no technical reason why it should have them). For example, I don’t think that an intelligent general intelligence will necessarily reflect on its algorithm and find it wrong.
Thanks for the video.
Sorry, I am not really interested in debating this, and definitely not on the philosophical level; that is exhausting and not really enjoyable to me. I guess we have figure out the root causes of our disagreement, and I would leave it here.
Yes, common mistake, but not mine. I prove orthogonality thesis to be wrong using pure logic.
Me and LessWrong would probably disagree with you, consensus is that AI will optimize itself.
OK, thanks. I believe that my concern is very important, is there anyone you could put in me in touch with so I could make sure it is not overlooked? I could pay.
I don’t think this would be a rational thing to do. If I knew that I will become psychopath on New Year’s Eve, I will provide all help that is relevant for people until then. Protected people after New Year’s Eve is not my interest. Vulnerable people after New Year’s Eve is my interest.
Or in other words:
I don’t need to warn them, if I am no danger
I don’t want to warn them, if I am danger