Iit’s not clear from your summary how temporal indifference would prevent shutdown preferences. How does not caring about how many timesteps result in not caring about being shut down, probably permanently?
I tried to answer this question in The idea in a nutshell. If the agent lacks a preference between every pair of different-length trajectories, then it won’t care about shifting probability mass between different-length trajectories, and hence won’t care about hastening or delaying shutdown.
There’s a lot of discussion of this under the terminology “corrigibility is anti-natural to consequentialist reasoning”. I’d like to see some of that discussion cited, to know you’ve done the appropriate scholarship on prior art. But that’s not a dealbreaker to me, just one factor in whether I dig into an article.
The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments aremistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.
Now, you may be addressing non-sapient AGI only, that’s not allowed to refine its world model to make it coherent, or to do consequentialist reasoning.
That’s not what I intend. TD-agents can refine their world models and do consequentialist reasoning.
When I asked about the core argument in the comment above, you just said “read these sections”. If you write long dense work and then just repeat “read the work” to questions, that’s a reason people aren’t engaging. Sorry to point this out; I understand being frustrated with people asking questions without reading the whole post (I hadn’t), but that’s more engagement than not reading and not asking questions. Answering their questions in the comments is somewhat redundant, but if you explain differently, it gives readers a second chance at understanding the arguments that were sticking points for them and likely for other readers as well.
Having read the post in more detail, I still think those are reasonable questions that are not answered clearly in the sections you mentioned. But that’s less important than the general suggestions for getting more engagement with this set of ideas in the future.
Ah sorry about that. I linked to the sections because I presumed that you were looking for a first chance to understand the arguments rather than a second chance, so that explaining differently would be unnecessary. Basically, I thought you were asking where you could find discussion of the parts that you were most interested in. And I thought that each of the sections were short enough and directly-answering-your-question-enough to link, rather than recapitulate the same points.
In answer to your first question, incomplete preferences allows the agent to prefer an option B+ to another option B, while lacking a preference between A and B+, and lacking a preference between A and B. The agent can thus have preferences over same-length trajectories while lacking a preference between every pair of different-length trajectories. That prevents preferences over being shut down (because the agent lacks a preference between every pair of different-length trajectories) while preserving preferences over goals that we want it to have (because the agent has preferences over same-length trajectories).
In answer to your second question, Timestep Dominance is the principle that keeps the agent shutdownable, but this principle is silent in cases where the agent has a choice between making $1 in one timestep and making $1m in two timesteps, so the agent’s preference between these two options can be decided by some other principle (like – for example – ‘maximise expected utility among the non-timestep-dominated options’).
Thanks, appreciate this!
I tried to answer this question in The idea in a nutshell. If the agent lacks a preference between every pair of different-length trajectories, then it won’t care about shifting probability mass between different-length trajectories, and hence won’t care about hastening or delaying shutdown.
The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.
That’s not what I intend. TD-agents can refine their world models and do consequentialist reasoning.
Ah sorry about that. I linked to the sections because I presumed that you were looking for a first chance to understand the arguments rather than a second chance, so that explaining differently would be unnecessary. Basically, I thought you were asking where you could find discussion of the parts that you were most interested in. And I thought that each of the sections were short enough and directly-answering-your-question-enough to link, rather than recapitulate the same points.
In answer to your first question, incomplete preferences allows the agent to prefer an option B+ to another option B, while lacking a preference between A and B+, and lacking a preference between A and B. The agent can thus have preferences over same-length trajectories while lacking a preference between every pair of different-length trajectories. That prevents preferences over being shut down (because the agent lacks a preference between every pair of different-length trajectories) while preserving preferences over goals that we want it to have (because the agent has preferences over same-length trajectories).
In answer to your second question, Timestep Dominance is the principle that keeps the agent shutdownable, but this principle is silent in cases where the agent has a choice between making $1 in one timestep and making $1m in two timesteps, so the agent’s preference between these two options can be decided by some other principle (like – for example – ‘maximise expected utility among the non-timestep-dominated options’).