(Just tried having claude turn the thread into markdown, which seems to have worked):
xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 3
Should AI be aligned with human preferences, rewards, or utility functions? Excited to finally share a preprint that @MicahCarroll @FranklinMatija @hal_ashton & I have worked on for almost 2 years, arguing that AI alignment has to move beyond the preference-reward-utility nexus!
This paper (https://arxiv.org/abs/2408.16984) is at once a critical review & research agenda. In it we characterize the role of preferences in AI alignment in terms of 4 preferentist theses. We then highlight their limitations, arguing for alternatives that are ripe for further research.
Our paper addresses each of the 4 theses in turn:
T1. Rational choice theory as a descriptive theory of humans
T2. Expected utility theory as a normative account of rational agency
T3. Single-human AI alignment as pref. matching
T4. Multi-human AI alignment as pref. aggregation
Addressing T1, we examine the limitations of modeling humans as (noisy) maximizers of utility functions (as done in RLHF & inverse RL), which fails to account for:
Bounded rationality
Incomplete preferences & incommensurable values
The thick semantics of human values
As alternatives, we argue for:
Modeling humans as resource-rational agents
Accounting for how we do or do not commensurate / trade-off our values
Learning the semantics of human evaluative concepts, which preferences do not capture
We then turn to T2, arguing that expected utility (EU) maximization is normatively inadequate. We draw on arguments by @ElliotThornley & others that coherent EU maximization is not required for AI agents. This means AI alignment need not be framed as “EU maximizer alignment”.
Jeremy Gillen @jeremygillen1 · Sep 4
I’m fairly confident that Thornley’s work that says preference incompleteness isn’t a requirement of rationality is mistaken. If offered the choice to complete its preferences, an agent acting according to his decision rule should choose to do so.
As long as it can also shift around probabilities of its future decisions, which seems reasonable to me. See Why Not Subagents?
xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 4
Hi! So first I think it’s worth clarifying that Thornley is focusing on what advanced AI agents will do, and is not as committed to saying something about the requirements of rationality (that’s our interpretation).
But to the point of whether an agent would/should choose to complete its preferences, see Sami Petersen’s more detailed argument on “Invulnerable Incomplete Preferences”:
Regarding the trade between (sub)agents argument, I think that only holds in certain conditions—I wrote a comment on that post discussing one intuitive case where trade is not possible / feasible.
Oops sorry I see you were linking to a specific comment in that thread—will read, thanks!
Hmm okay, I read the money pump you proposed! It’s interesting but I don’t buy the move of assigning probabilities to future decisions. As a result, I don’t think the agent is required to complete its preferences, but can just plan in advance to go for A+ or B.
I think Petersen’s “Dynamic Strong Maximality” decision rule captures that kind of upfront planning (in a way that may go beyond the Caprice rule) while maintaining incompleteness, but I’m not 100% sure.
The move of assigning probabilities to future actions was something Thornley started, not me. Embedded agents should be capable of this (future actions are just another event in the world). Although doesn’t work with infrabeliefs, so maybe in that case the money pump could break.
I’m not as familiar with Petersen’s argument, but my impression is that it results in actions indistinguishable from those of an EU maximizer with completed preferences (in the resolute choice case). Do you know any situation where it isn’t representable as an EU maximizer?
This is in contrast to Thornley’s rule, which does sometimes choose the bottom path of the money pump, which makes it impossible to represent as a EU maximizer. This seems like real incomplete preferences.
It seems incorrect to me to describe Peterson’s argument as formalizing the same counter-argument further (as you do in the paper), given how their proposals seem to have quite different properties and rely on different arguments.
xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 4
I wasn’t aware of this difference when writing that part of the paper! But AFAIK Dynamic Strong Maximality generalizes the Caprice rule, so that it behaves the same on the single-souring money pump, but does the “right thing” in the single-sweetening case.
Regarding whether DSM-agents are representable as EU maximizers, Petersen has a long section on this in the article (they call this the “Tramelling Concern”):
Section 3.1 seems consistent with my understanding. Sami is saying that that the DSM-agent arbitrarily chooses a plan among those that result in one of the maximally valued outcomes.
He calls this untrammeled, because even though the resulting actions could have been generated by an agent with complete preferences, it “could have” made another choice at the beginning.
But this kind of “incompleteness” looks useless to me. Intuitively: If AI designers are happy with each of several complete sets of preferences, they could arbitrarily choose one and then put them into an agent with complete preferences.
All Sami’s approach does is let the AI do exactly that arbitrary choice just before it starts acting. If you want an locally coherent AI tool, as you discuss later in the paper, this approach won’t help you.
You can get the kind of Taskish behavior you want by being very careful about the boundedness and locality of the preferences, and using separate locally bounded Tool AIs each with a separate task (as you describe in the paper).
But the local completeness proposal at the end of 3.2 in your paper will break if it is capable of weak forms of self-modification or commitment, due to the money pump argument.
I do think it’s possible to make such local Taskish agents work. You’d just need to exploit the main problem with VNM, which is that it doesn’t allow preferences over non-terminal outcomes.
Sorry for being so critical, overall I think the paper is good and all of the arguments I looked at outside of section 3 seem strong. Well done.
xuan (ɕɥɛn / sh-yen) @xuanalogue · 20h
Thank you! I’m still not sure if DSM-agents will have that failure mode. I think the cases that seem most important to think through is how they’ll behave under uncertainty about whether shutdown / context switching will happen.
At least in the full Thornley proposal, there are cases where it’s clear to me that having complete preferences over trajectory lengths will cause shutdown-avoidance/seeking, and that DSM-agents with incomplete preferences over trajectory lengths will avoid.
Perhaps those DSM-agents can represented as having complete prefences once they’ve committed to a particular plan/policy. But if so then it seems like that preference ordering will have to be over something other than trajectories.
Jeremy Gillen @jeremygillen1 · 17h
I’ve been usually assuming the preferences are over final outcomes, as usual in VNM. Incompleteness is kinda useless if the preferences are over trajectories, because any behavior can be implemented anyway.
I think you’re wrong that DSM-agents with incomplete preferences will avoid shutdown-avoidance/seeking. I’d be interested to hear the cases that are clear to you.
I’ve constructed a toy scenario that combines my money pump with the classic button manipulation scenario (for utility indifference) from section 4.2 of the Corrigibility paper.
The agent first makes an observation. Then it has an opportunity to deliberately trigger shutdown (if it does, then length 1 trajectory). If it doesn’t shutdown, it does a length 2 trajectory, whose value depends on the initial observation.
With the observation it could get lucky (90%) or unlucky (10%). Possible outcomes are L, U and S. Preferences are L > U, L||S, U||S. (All different length trajectories are pref-gapped, as per Thornley’s proposal).
Let’s say that by default the agent deliberately chooses shutdown 50% of the time. (or any arbitrary probability, since, as per Thornley’s proposal, stochastic choice between trajectory lengths is indicative of preferential gaps). Then the overall outcome is {L:45%, U:5%, S:50%}.
Then an agent could do strictly better by committing at the beginning to cause shutdown if it observes the unlucky observation, and cause shutdown 44.4% of the time if it observes the lucky observation.
The resulting outcome is {L:50%, U:0%, S:50%}. Since the relative probabilities of each trajectory length haven’t changed, this should statewise dominate the outcome of the previous policy, so it is chosen by Thornley’s decision rule. It’s a pure shift of probability from U to L.
Under DSM (with resolute choice), the agent never implements a plan that is suboptimal from an earlier perspective, so it will also choose beforehand to commit to push the shutdown button if it observes the unlucky observation.
(This is a simplification, since the human button presser is missing, but it all works the same if you add the human and have actions that disconnect or press the button. I left all button control to the agent because it’s sufficient to demonstrate that it will manipulate).
xuan (ɕɥɛn / sh-yen) @xuanalogue · 16h
Thanks, I’ll think about this! The cases I think where complete preferences over trajectory lengths lead to shutdown avoidance are on pg. 42-43 of the original IPP write-up:
BTW it also seems like Thornley discusses cases like yours as forms of “Managing the News” in Section 15 of that same document, and acknowledges that it is a limitation!
This is in contrast to Thornley’s rule, which does sometimes choose the bottom path of the money pump, which makes it impossible to represent as a EU maximizer. This seems like real incomplete preferences.
It seems incorrect to me to describe Peterson’s argument as formalizing the same counter-argument further (as you do in the paper), given how their proposals seem to have quite different properties and rely on different arguments.
I think I was wrong about this. I misinterpreted a comment made by Thornley, sorry! See here for details.
(Just tried having claude turn the thread into markdown, which seems to have worked):
xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 3
Should AI be aligned with human preferences, rewards, or utility functions? Excited to finally share a preprint that @MicahCarroll @FranklinMatija @hal_ashton & I have worked on for almost 2 years, arguing that AI alignment has to move beyond the preference-reward-utility nexus!
This paper (https://arxiv.org/abs/2408.16984) is at once a critical review & research agenda. In it we characterize the role of preferences in AI alignment in terms of 4 preferentist theses. We then highlight their limitations, arguing for alternatives that are ripe for further research.
Our paper addresses each of the 4 theses in turn:
T1. Rational choice theory as a descriptive theory of humans
T2. Expected utility theory as a normative account of rational agency
T3. Single-human AI alignment as pref. matching
T4. Multi-human AI alignment as pref. aggregation
Addressing T1, we examine the limitations of modeling humans as (noisy) maximizers of utility functions (as done in RLHF & inverse RL), which fails to account for:
Bounded rationality
Incomplete preferences & incommensurable values
The thick semantics of human values
As alternatives, we argue for:
Modeling humans as resource-rational agents
Accounting for how we do or do not commensurate / trade-off our values
Learning the semantics of human evaluative concepts, which preferences do not capture
We then turn to T2, arguing that expected utility (EU) maximization is normatively inadequate. We draw on arguments by @ElliotThornley & others that coherent EU maximization is not required for AI agents. This means AI alignment need not be framed as “EU maximizer alignment”.
Jeremy Gillen @jeremygillen1 · Sep 4
I’m fairly confident that Thornley’s work that says preference incompleteness isn’t a requirement of rationality is mistaken. If offered the choice to complete its preferences, an agent acting according to his decision rule should choose to do so.
As long as it can also shift around probabilities of its future decisions, which seems reasonable to me. See Why Not Subagents?
xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 4
Hi! So first I think it’s worth clarifying that Thornley is focusing on what advanced AI agents will do, and is not as committed to saying something about the requirements of rationality (that’s our interpretation).
But to the point of whether an agent would/should choose to complete its preferences, see Sami Petersen’s more detailed argument on “Invulnerable Incomplete Preferences”:
https://alignmentforum.org/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1
Regarding the trade between (sub)agents argument, I think that only holds in certain conditions—I wrote a comment on that post discussing one intuitive case where trade is not possible / feasible.
Oops sorry I see you were linking to a specific comment in that thread—will read, thanks!
Hmm okay, I read the money pump you proposed! It’s interesting but I don’t buy the move of assigning probabilities to future decisions. As a result, I don’t think the agent is required to complete its preferences, but can just plan in advance to go for A+ or B.
I think Petersen’s “Dynamic Strong Maximality” decision rule captures that kind of upfront planning (in a way that may go beyond the Caprice rule) while maintaining incompleteness, but I’m not 100% sure.
Yeah, there’s a discussion of this in footnote 16 of the Petersen article: https://alignmentforum.org/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1#fnrefr2zvmaagbir
Jeremy Gillen @jeremygillen1 · Sep 4
The move of assigning probabilities to future actions was something Thornley started, not me. Embedded agents should be capable of this (future actions are just another event in the world). Although doesn’t work with infrabeliefs, so maybe in that case the money pump could break.
I’m not as familiar with Petersen’s argument, but my impression is that it results in actions indistinguishable from those of an EU maximizer with completed preferences (in the resolute choice case). Do you know any situation where it isn’t representable as an EU maximizer?
This is in contrast to Thornley’s rule, which does sometimes choose the bottom path of the money pump, which makes it impossible to represent as a EU maximizer. This seems like real incomplete preferences.
It seems incorrect to me to describe Peterson’s argument as formalizing the same counter-argument further (as you do in the paper), given how their proposals seem to have quite different properties and rely on different arguments.
xuan (ɕɥɛn / sh-yen) @xuanalogue · Sep 4
I wasn’t aware of this difference when writing that part of the paper! But AFAIK Dynamic Strong Maximality generalizes the Caprice rule, so that it behaves the same on the single-souring money pump, but does the “right thing” in the single-sweetening case.
Regarding whether DSM-agents are representable as EU maximizers, Petersen has a long section on this in the article (they call this the “Tramelling Concern”):
https://alignmentforum.org/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1#3___The_Trammelling_Concern
Jeremy Gillen @jeremygillen1 · 21h
Section 3.1 seems consistent with my understanding. Sami is saying that that the DSM-agent arbitrarily chooses a plan among those that result in one of the maximally valued outcomes.
He calls this untrammeled, because even though the resulting actions could have been generated by an agent with complete preferences, it “could have” made another choice at the beginning.
But this kind of “incompleteness” looks useless to me. Intuitively: If AI designers are happy with each of several complete sets of preferences, they could arbitrarily choose one and then put them into an agent with complete preferences.
All Sami’s approach does is let the AI do exactly that arbitrary choice just before it starts acting. If you want an locally coherent AI tool, as you discuss later in the paper, this approach won’t help you.
You can get the kind of Taskish behavior you want by being very careful about the boundedness and locality of the preferences, and using separate locally bounded Tool AIs each with a separate task (as you describe in the paper).
But the local completeness proposal at the end of 3.2 in your paper will break if it is capable of weak forms of self-modification or commitment, due to the money pump argument.
I do think it’s possible to make such local Taskish agents work. You’d just need to exploit the main problem with VNM, which is that it doesn’t allow preferences over non-terminal outcomes.
Sorry for being so critical, overall I think the paper is good and all of the arguments I looked at outside of section 3 seem strong. Well done.
xuan (ɕɥɛn / sh-yen) @xuanalogue · 20h
Thank you! I’m still not sure if DSM-agents will have that failure mode. I think the cases that seem most important to think through is how they’ll behave under uncertainty about whether shutdown / context switching will happen.
At least in the full Thornley proposal, there are cases where it’s clear to me that having complete preferences over trajectory lengths will cause shutdown-avoidance/seeking, and that DSM-agents with incomplete preferences over trajectory lengths will avoid.
Perhaps those DSM-agents can represented as having complete prefences once they’ve committed to a particular plan/policy. But if so then it seems like that preference ordering will have to be over something other than trajectories.
Jeremy Gillen @jeremygillen1 · 17h
I’ve been usually assuming the preferences are over final outcomes, as usual in VNM. Incompleteness is kinda useless if the preferences are over trajectories, because any behavior can be implemented anyway.
I think you’re wrong that DSM-agents with incomplete preferences will avoid shutdown-avoidance/seeking. I’d be interested to hear the cases that are clear to you.
I’ve constructed a toy scenario that combines my money pump with the classic button manipulation scenario (for utility indifference) from section 4.2 of the Corrigibility paper.
The agent first makes an observation. Then it has an opportunity to deliberately trigger shutdown (if it does, then length 1 trajectory). If it doesn’t shutdown, it does a length 2 trajectory, whose value depends on the initial observation.
With the observation it could get lucky (90%) or unlucky (10%). Possible outcomes are L, U and S. Preferences are L > U, L||S, U||S. (All different length trajectories are pref-gapped, as per Thornley’s proposal).
Let’s say that by default the agent deliberately chooses shutdown 50% of the time. (or any arbitrary probability, since, as per Thornley’s proposal, stochastic choice between trajectory lengths is indicative of preferential gaps). Then the overall outcome is {L:45%, U:5%, S:50%}.
Then an agent could do strictly better by committing at the beginning to cause shutdown if it observes the unlucky observation, and cause shutdown 44.4% of the time if it observes the lucky observation.
The resulting outcome is {L:50%, U:0%, S:50%}. Since the relative probabilities of each trajectory length haven’t changed, this should statewise dominate the outcome of the previous policy, so it is chosen by Thornley’s decision rule. It’s a pure shift of probability from U to L.
Under DSM (with resolute choice), the agent never implements a plan that is suboptimal from an earlier perspective, so it will also choose beforehand to commit to push the shutdown button if it observes the unlucky observation.
(This is a simplification, since the human button presser is missing, but it all works the same if you add the human and have actions that disconnect or press the button. I left all button control to the agent because it’s sufficient to demonstrate that it will manipulate).
xuan (ɕɥɛn / sh-yen) @xuanalogue · 16h
Thanks, I’ll think about this! The cases I think where complete preferences over trajectory lengths lead to shutdown avoidance are on pg. 42-43 of the original IPP write-up:
BTW it also seems like Thornley discusses cases like yours as forms of “Managing the News” in Section 15 of that same document, and acknowledges that it is a limitation!
I think I was wrong about this. I misinterpreted a comment made by Thornley, sorry! See here for details.