Donatas Lučiūnas
No. Orthogonality is when agent follows any given goal, not when you give it. And as my thought experiment shows it is not intelligent to blindly follow given goal.
In the thought experiment description it is said that terminal goal is cups until new year’s eve and then changed to paperclips. And agent is aware of this change upfront. What do you find problematic with such setup?
Yes, I find terminal rationality irrational (I hope my thought experiment helps illustrate that).
I have another formal definition of “rational”. I’ll expand a little more.
Once, people had to make a very difficult decision. People had five alternatives and had to decide which was the best. Wise men from all over the world gathered and conferred.
The first to speak was a Christian. He pointed out that the first alternative was the best and should be chosen. He had no arguments, but simply stated that he believed so.
Then a Muslim spoke. He said that the second alternative was the best and should be chosen. He did not have any arguments either, but simply stated that he believed so.
People were not happy, it has not become clearer yet.
The humanist spoke. He said that the third alternative was the best and should be chosen. “It is the best because it will contribute the most to the well-being, progress and freedom of the people,” he argued.
Then the existentialist spoke. He pointed out that there was no need to find a common solution, but that each individual could make his own choice of what he thought best. A Catholic can choose the first option, a Muslim the second, a humanist the third. Everyone must decide for himself what is best for him.
Then the nihilist spoke. He pointed out that although the alternatives are different, there is no way to evaluate which alternative is better. Therefore, it does not matter which one people choose. They are all equally good. Or equally bad. The nihilist suggested that people simply draw lots.
It still hasn’t become clearer to the people, but patience was running out.
And then a simple man in the crowd spoke up:
We still don’t know which is the best alternative, right?
Right, - murmured those around.
But we may find out in the future, right?
Right.
Then the better alternative is the one that leaves the most freedom to change the decision in the future.
Sounds reasonable, - murmured those around.
You may think—it breaks Hume’s law. No it doesn’t. Facts and values stay distinct. Hume’s law does not state that values must be invented, they can be discovered, this was a wrong interpretation by Nick Bostrom.
Why do you think intelligent agent would follow Von Neumann–Morgenstern utility theorem? It has limitations, for example it assumes that all possible outcomes and their associated probabilities are known. Why not Robust decision-making?
Goal preservation is mentioned in Instrumental Convergence.
Whatever is written in the slot marked “terminal goal” is what it will try to achieve at the time.
So you choose 1st answer now?
Oh yes, indeed, we discussed this already. I hear you, but you don’t seem to hear me. And I feel there is nothing I can do to change that.
However before the New Year’s Eve paperclips are not hated also. The agent has no interest to prevent their production.
And once goal changes having some paperclips produced already is better than having none.
Don’t you see that there is a conflict?
It seems you say—if terminal goal changes, agent is not rational. How could you say that? Agent has no control over its terminal goal, or you don’t agree?
I’m surprised that you believe in orthogonality thesis so much that you think “rationality” is the weak part of this though experiment. It seems you deny the obvious to defend your prejudice. What arguments would challenge your belief in orthogonality thesis?
How does this work with the fact that future is unpredictable?
It seems you didn’t try to answer this question.
The agent will reason:
Future is unpredictable
It is possible that my terminal goal will be different by the time I get outcomes of my actions
Should I take that into account when choosing actions?
If I don’t take that into account, I’m not really intelligent, because I am aware of these risks and I ignore them.
If I take that into account, I’m not really aligned with my terminal goal.
Let’s assume maximum willpower and maximum rationality.
Whether they optimize for the future or the present
I think the answer is in the definition of intelligence.
So which one is it?
The fact that the answer is not straightforward proves my point already. There is a conflict between intelligence and terminal goal and we can debate which will prevail. But the problem is that according to orthogonality thesis such conflict should not exist.
You don’t seem to take my post seriously. I think I showed that there is a conflict between intelligence and terminal goal, while orthogonality thesis say such conflict is impossible.
OK, I’m open to discuss this further using your concept.
As I understand you agree that correct answer is 2nd?
It is not clear to me what any of this has to do with Orthogonality.
I’m not sure how patient you are, but I can reassure that we will come to Orthogonality if you don’t give up 😄
So if I understand your concept correctly a super intelligent agent will combine all future terminal goals to a single unchanging goal. How does this work with the fact that future is unpredictable? The agent will work towards all possible goals? It is possible that in the future grue will mean green, blue or even red.
Terminal goal vs Intelligence
philosophy is prone to various kinds of mistakes, such as anthropomorphization
Yes, common mistake, but not mine. I prove orthogonality thesis to be wrong using pure logic.
For example, I don’t think that an intelligent general intelligence will necessarily reflect on its algorithm and find it wrong.
Me and LessWrong would probably disagree with you, consensus is that AI will optimize itself.
I am not really interested in debating this
OK, thanks. I believe that my concern is very important, is there anyone you could put in me in touch with so I could make sure it is not overlooked? I could pay.
Have you tried writing actual code?
That’s probably the root cause for our disagreement. My findings are on a very high philosophical level (fact value distinction) and you seem to try to interpret them on very low level (code). I think this gap prevent us from finding consensus.
There are 2 ways to solve that—I could go down to code or you could go up to philosophy. And I don’t like idea going down to code, because:
this will be extremely exhausting
this code would be extremely dangerous
I might not be able to create a good example and that would not prove that I’m wrong
Would you consider to go up to philosophy? Science typically goes in front of applied science.
There is such thing in logic—proof by contradiction. I think your current beliefs lead to a contradiction. Don’t you think?
evaluate all options, choose the one that leads to more cups; if there is more than one such option, choose randomly
The problem is—this algorithm is not intelligent. It may only work on agents with poor reasoning abilities. Smarter agents will not follow this algorithm, because they will notice a contradiction—there might be things that I don’t know yet that are much more important than cups and caring about cups wastes my resources.
(Also, come on, LLMs are notoriously bad at math, plus if you push them hard enough you can convince them of a lot of things.)
People (even very smart people) are also notoriously bad at math. I found this video informative
I did not push LLMs.
ChatGPT picked 2024-12-31 18:00.
Gemini picked 2024-12-31 18:00.
Claude picked 2025-01-01 00:00.
I don’t know how can I make it more obvious that your belief is questionable. I don’t think you follow “If you disagree, try getting curious about what your partner is thinking”. That’s the problem not only with you, but with LessWrong community. I know that preserving such belief is very important for you. But I’d like to kindly invite you to be a bit more sceptical.
How can you say that these forecasts are equal?
A little thought experiment.
Imagine there is an agent that has a terminal goal to produce cups. The agent knows that its terminal goal will change on New Year’s Eve to produce paperclips. The agent has only one action available to him—start paperclip factory. The factory starts producing paperclips 6 hours after it is started.
When will the agent start the paperclip factory? 2024-12-31 18:00? 2025-01-01 00:00? Now? Some other time?
just to minimize the possible harm to these people if that happens, I will on purpose never collect their personal data, and will also tell them to be suspicious of me if I contact them in future
I don’t think this would be a rational thing to do. If I knew that I will become psychopath on New Year’s Eve, I will provide all help that is relevant for people until then. Protected people after New Year’s Eve is not my interest. Vulnerable people after New Year’s Eve is my interest.
Or in other words:
I don’t need to warn them, if I am no danger
I don’t want to warn them, if I am danger
I am sure you can’t prove your position. And I am sure I can prove my position.
Your reasoning is based on assumption that all value is known. If utility function assigns value to something—it is valuable. If utility function does not assign value—it is not valuable. While the truth is that something might be valuable but your utility function does not know it yet. It would be more intelligent to use 3 categories—valuable, not valuable and unknown.
Let’s say you are booking a flight and you have a possibility to get checked baggage for free. It’s absolutely not relevant for you to your best current knowledge. But you understand that your knowledge might change and it costs nothing to keep more options open, so you take the checked baggage.
Let’s say you are traveler, wanderer. You have limited space in your backpack. Sometimes you find items and you need to choose—put it in the backpack or not. You definitely keep items that are useful. You leave behind items that are not useful. What you do if you find an item which usefulness is unknown? Some mysterious item. Take it if it is small, leave it if it is big? According to you it is obvious to leave it. Does not sound intelligent for me.
Options look like this:
Leave item
no burden 👍
no opportunity to use it
Take item
a burden 👎
may be useful, may be harmful, may have no effect
knowledge about usefuness of an item 👍
Don’t you think that “knowledge about usefuness of an item” can sometimes be worth “a burden”? Basically I described a concept of experiment here.
You will probably say—sure, sounds good, but applies for instrumental goals only. There is no reason to assume that. I tried to highlight that ignoring unknowns is not intelligent. This applies for both terminal and instrumental goals.
Let’s say there is a paperclip maximizer which knows its terminal goal will change to pursuit of happiness in a week. His decisions basically lead to these outcomes:
Want paperclips, have paperclips
Want paperclips, have happiness
Want happiness, have paperclips
Want happiness, have happiness
1st and 4th are better outcomes than 2nd and 3rd. And I think intelligent agent would work on both (1st and 4th) if they do not conflict. Of course my previous problem with many unknown future goals is more complex, but I hope you see, that focusing on 1st and not caring about 4th at all is not intelligent.
We are deep in a rabbit hole, but I hope you understand the importance. If intelligence and goal are coupled (according to me they are) all current alignment research is dangerously misleading.
Why would you assume AGI will use Byaesian decision system? Such system would be limited to known probabilities.
Unknown probability = 0 probability
is not intelligent (Hitchens’s razor, Black swan theory, Fitch’s paradox of knowability, Robust decision-making). Once you incorporate this, Orthogonality Thesis is no longer valid—it becomes obvious that every intelligent AI will only work in single direction (which is disastrous to humans). I know there is a huge gap between “unknown probabilities” and “existential risk”, you can find more information in my posts and I am available to explain it verbally (calendly link below). Short teaser—it is possible that terminal value can also be discovered (not only assumed), this seems to be overlooked in current AI alignment research.