How do agents with preferential gaps fit into this? I think preferential gaps are a kind of weak incompleteness, and thus handled by your second step?
Context: I’m pretty interested in the claims in this post, and their implications. A while ago, I went back and forth with EJT a bit on his coherence theorems post. The thread ended here with a claim by EJT:
And agents with many preferential gaps may behave quite differently to expected utility maximizers.
I didn’t have a counterpoint at the time, but I am pretty skeptical that this claim is true, intuitively.
An agent with even infinitely many preferential gaps seems very close in mind-space to an agent with complete preferences: all it is missing is a relatively simple-to-describe function which “breaks the tie” on things it is already very close to indifferent about. And different choices of tiebreaker function seem unlikely to lead to importantly different behavior: for any choice of tiebreaker function, you are back to an EU maximizer.
The only remaining hope is to avoid having the agent ever pick or be imbued with a tiebreaker function at all. That requires at least two things:
The agent’s creators must not initialize it with such a tiebreaker function (seems unlikely to happen by default, but maybe if the creators are alignment researchers who know what they are doing, it’s possible)
The agent itself must be stable enough that it never chooses to self-modify or drift into completeness on its own. And I think your claim, if I’m understanding it correctly, is that such stability is unlikely, because completing the preferences can lead to a strict improvement in outcomes under the preferences of the original agent.
Am I understanding your claims correctly, and do you agree with my reasoning that EJT’s claim is thus unlikely to be true?
How do agents with preferential gaps fit into this? I think preferential gaps are a kind of weak incompleteness, and thus handled by your second step?
Context: I’m pretty interested in the claims in this post, and their implications. A while ago, I went back and forth with EJT a bit on his coherence theorems post. The thread ended here with a claim by EJT:
I didn’t have a counterpoint at the time, but I am pretty skeptical that this claim is true, intuitively.
An agent with even infinitely many preferential gaps seems very close in mind-space to an agent with complete preferences: all it is missing is a relatively simple-to-describe function which “breaks the tie” on things it is already very close to indifferent about. And different choices of tiebreaker function seem unlikely to lead to importantly different behavior: for any choice of tiebreaker function, you are back to an EU maximizer.
The only remaining hope is to avoid having the agent ever pick or be imbued with a tiebreaker function at all. That requires at least two things:
The agent’s creators must not initialize it with such a tiebreaker function (seems unlikely to happen by default, but maybe if the creators are alignment researchers who know what they are doing, it’s possible)
The agent itself must be stable enough that it never chooses to self-modify or drift into completeness on its own. And I think your claim, if I’m understanding it correctly, is that such stability is unlikely, because completing the preferences can lead to a strict improvement in outcomes under the preferences of the original agent.
Am I understanding your claims correctly, and do you agree with my reasoning that EJT’s claim is thus unlikely to be true?