This is in response to your shortform question of why this didn’t get more engagement. It’s an attempt to explain why I didn’t engage further, and I think some others will share some of my issues, particularly with clarity and brevity of the core argument:
I’d have needed a better summary to dig through the detailed formalism. I appreciate that you included it; it just didn’t hit the points I care about.
Iit’s not clear from your summary how temporal indifference would prevent shutdown preferences. How does not caring about how many timesteps result in not caring about being shut down, probably permanently? I assume you explain this to your satisfaction in the text, but expecting me to parse your formalisms without even making a claim in English about how they produce the desired result seems like a bad sign for how much time I’d have to invest to evaluate your proposal.
Second, neither the summary nor, AFAIK the full proposal address what I take to be the hard problem of shutdownability:; not caring about being shut down, while still caring about every other obstacle to completing ones’ goals, is irrational. You have to create a lacuna in a world-model, or the motivation for that portion of the model. It has to not care about being shut down, but still care about all of the other things whose semantics overlap with those of being shut down (hostile agents or circumstance preventing work on the problem. I think this is the same concern Ryan Greenblatt is expressing as “assuming TD preferences generalize perfectly”.
There’s a lot of discussion of this under the terminology “corrigibility is anti-natural to consequentialist reasoning”. I’d like to see some of that discussion cited, to know you’ve done the appropriate scholarship on prior art. But that’s not a dealbreaker to me, just one factor in whether I dig into an article.
Now, you may be addressing non-sapient AGI only, that’s not allowed to refine its world model to make it coherent, or to do consequentialist reasoning. If so (and this was my assumption), I’m not interested in your solution even if it works perfectly. I think it would be great if a non-sapient AI resisted shutdown, because I think it would fail, and that would serve as a warning shot before a sapient AGI resists successfully.
My belief that only reflective, learning AGI is the real important threat model is a minority opinion right now, but it’s a large minority. In essence, I think someone will add reflection and continuous self-directed learning almost immediately to any AI capable enough to be dangerous. And then it will be more capable and much more dangerous.
When I asked about the core argument in the comment above, you just said “read these sections”. If you write long dense work and then just repeat “read the work” to questions, that’s a reason people aren’t engaging. Sorry to point this out; I understand being frustrated with people asking questions without reading the whole post (I hadn’t), but that’s more engagement than not reading and not asking questions. Answering their questions in the comments is somewhat redundant, but if you explain differently, it gives readers a second chance at understanding the arguments that were sticking points for them and likely for other readers as well.
Having read the post in more detail, I still think those are reasonable questions that are not answered clearly in the sections you mentioned. But that’s less important than the general suggestions for getting more engagement with this set of ideas in the future.
Sorry to be so critical; this is a response to your question of why people weren’t engaging more, so I assume you want harsh truths.
Edit: TBC, I’m not saying “this wouldn’t work”, I’m saying “I don’t understand it enough to know whether it would work, although I suspect it wouldn’t. Please explain more clearly and briefly so more of us can think about it with less time investment”.
Iit’s not clear from your summary how temporal indifference would prevent shutdown preferences. How does not caring about how many timesteps result in not caring about being shut down, probably permanently?
I tried to answer this question in The idea in a nutshell. If the agent lacks a preference between every pair of different-length trajectories, then it won’t care about shifting probability mass between different-length trajectories, and hence won’t care about hastening or delaying shutdown.
There’s a lot of discussion of this under the terminology “corrigibility is anti-natural to consequentialist reasoning”. I’d like to see some of that discussion cited, to know you’ve done the appropriate scholarship on prior art. But that’s not a dealbreaker to me, just one factor in whether I dig into an article.
The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments aremistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.
Now, you may be addressing non-sapient AGI only, that’s not allowed to refine its world model to make it coherent, or to do consequentialist reasoning.
That’s not what I intend. TD-agents can refine their world models and do consequentialist reasoning.
When I asked about the core argument in the comment above, you just said “read these sections”. If you write long dense work and then just repeat “read the work” to questions, that’s a reason people aren’t engaging. Sorry to point this out; I understand being frustrated with people asking questions without reading the whole post (I hadn’t), but that’s more engagement than not reading and not asking questions. Answering their questions in the comments is somewhat redundant, but if you explain differently, it gives readers a second chance at understanding the arguments that were sticking points for them and likely for other readers as well.
Having read the post in more detail, I still think those are reasonable questions that are not answered clearly in the sections you mentioned. But that’s less important than the general suggestions for getting more engagement with this set of ideas in the future.
Ah sorry about that. I linked to the sections because I presumed that you were looking for a first chance to understand the arguments rather than a second chance, so that explaining differently would be unnecessary. Basically, I thought you were asking where you could find discussion of the parts that you were most interested in. And I thought that each of the sections were short enough and directly-answering-your-question-enough to link, rather than recapitulate the same points.
In answer to your first question, incomplete preferences allows the agent to prefer an option B+ to another option B, while lacking a preference between A and B+, and lacking a preference between A and B. The agent can thus have preferences over same-length trajectories while lacking a preference between every pair of different-length trajectories. That prevents preferences over being shut down (because the agent lacks a preference between every pair of different-length trajectories) while preserving preferences over goals that we want it to have (because the agent has preferences over same-length trajectories).
In answer to your second question, Timestep Dominance is the principle that keeps the agent shutdownable, but this principle is silent in cases where the agent has a choice between making $1 in one timestep and making $1m in two timesteps, so the agent’s preference between these two options can be decided by some other principle (like – for example – ‘maximise expected utility among the non-timestep-dominated options’).
This is in response to your shortform question of why this didn’t get more engagement. It’s an attempt to explain why I didn’t engage further, and I think some others will share some of my issues, particularly with clarity and brevity of the core argument:
I’d have needed a better summary to dig through the detailed formalism. I appreciate that you included it; it just didn’t hit the points I care about.
Iit’s not clear from your summary how temporal indifference would prevent shutdown preferences. How does not caring about how many timesteps result in not caring about being shut down, probably permanently? I assume you explain this to your satisfaction in the text, but expecting me to parse your formalisms without even making a claim in English about how they produce the desired result seems like a bad sign for how much time I’d have to invest to evaluate your proposal.
Second, neither the summary nor, AFAIK the full proposal address what I take to be the hard problem of shutdownability:; not caring about being shut down, while still caring about every other obstacle to completing ones’ goals, is irrational. You have to create a lacuna in a world-model, or the motivation for that portion of the model. It has to not care about being shut down, but still care about all of the other things whose semantics overlap with those of being shut down (hostile agents or circumstance preventing work on the problem. I think this is the same concern Ryan Greenblatt is expressing as “assuming TD preferences generalize perfectly”.
There’s a lot of discussion of this under the terminology “corrigibility is anti-natural to consequentialist reasoning”. I’d like to see some of that discussion cited, to know you’ve done the appropriate scholarship on prior art. But that’s not a dealbreaker to me, just one factor in whether I dig into an article.
Now, you may be addressing non-sapient AGI only, that’s not allowed to refine its world model to make it coherent, or to do consequentialist reasoning. If so (and this was my assumption), I’m not interested in your solution even if it works perfectly. I think it would be great if a non-sapient AI resisted shutdown, because I think it would fail, and that would serve as a warning shot before a sapient AGI resists successfully.
My belief that only reflective, learning AGI is the real important threat model is a minority opinion right now, but it’s a large minority. In essence, I think someone will add reflection and continuous self-directed learning almost immediately to any AI capable enough to be dangerous. And then it will be more capable and much more dangerous.
When I asked about the core argument in the comment above, you just said “read these sections”. If you write long dense work and then just repeat “read the work” to questions, that’s a reason people aren’t engaging. Sorry to point this out; I understand being frustrated with people asking questions without reading the whole post (I hadn’t), but that’s more engagement than not reading and not asking questions. Answering their questions in the comments is somewhat redundant, but if you explain differently, it gives readers a second chance at understanding the arguments that were sticking points for them and likely for other readers as well.
Having read the post in more detail, I still think those are reasonable questions that are not answered clearly in the sections you mentioned. But that’s less important than the general suggestions for getting more engagement with this set of ideas in the future.
Sorry to be so critical; this is a response to your question of why people weren’t engaging more, so I assume you want harsh truths.
Edit: TBC, I’m not saying “this wouldn’t work”, I’m saying “I don’t understand it enough to know whether it would work, although I suspect it wouldn’t. Please explain more clearly and briefly so more of us can think about it with less time investment”.
Thanks, appreciate this!
I tried to answer this question in The idea in a nutshell. If the agent lacks a preference between every pair of different-length trajectories, then it won’t care about shifting probability mass between different-length trajectories, and hence won’t care about hastening or delaying shutdown.
The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.
That’s not what I intend. TD-agents can refine their world models and do consequentialist reasoning.
Ah sorry about that. I linked to the sections because I presumed that you were looking for a first chance to understand the arguments rather than a second chance, so that explaining differently would be unnecessary. Basically, I thought you were asking where you could find discussion of the parts that you were most interested in. And I thought that each of the sections were short enough and directly-answering-your-question-enough to link, rather than recapitulate the same points.
In answer to your first question, incomplete preferences allows the agent to prefer an option B+ to another option B, while lacking a preference between A and B+, and lacking a preference between A and B. The agent can thus have preferences over same-length trajectories while lacking a preference between every pair of different-length trajectories. That prevents preferences over being shut down (because the agent lacks a preference between every pair of different-length trajectories) while preserving preferences over goals that we want it to have (because the agent has preferences over same-length trajectories).
In answer to your second question, Timestep Dominance is the principle that keeps the agent shutdownable, but this principle is silent in cases where the agent has a choice between making $1 in one timestep and making $1m in two timesteps, so the agent’s preference between these two options can be decided by some other principle (like – for example – ‘maximise expected utility among the non-timestep-dominated options’).