23. Corrigibility is anti-natural to consequentialist reasoning; “you can’t bring the coffee if you’re dead” for almost every kind of coffee. We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.
There is one approach to corrigibility that I don’t see mentioned in the “tried and failed” post Eliezer linked to here. It’s also one that someone at MIRI (Evan Hubinger) among others is still working on: myopia (i.e. myopic cognition).
There are different formulations, but the basic idea is that an AI with myopic cognition would have an extremely high time preference. This means that it would never sacrifice reward now for reward later, and so it would essentially be exempt from instrumental convergence. In theory such an AI would allow itself to be shut down (without forcing shutdown), and it would also not be prone to deceptive alignment.
Myopia isn’t fully understood yet and has a number of open problems. It also will likely require verification using advanced interpretability tools that haven’t been developed yet. I think it’s a research direction we as a field should be investing in to figure out if it can work though, and the corrigibility question shouldn’t be considered closed until we’ve at least done that. I can’t see anything unnatural about an agent that has both consequentialist reasoning capabilities and a high time preference.
(Note: I’m not suggesting that we should bet the farm on myopic cognition solving alignment, and I’m not suggesting that my critique of Eliezer’s point on corrigibility in this comment undermines the overall idea of his post that we’re in a very scary situation with AI x-risk. I agree with that and support spreading the word about it as he’s doing here, as well as working directly with leading AI labs to try and avoid catastrophe. I also support a number of other technical research directions including interpretability, and I’m open to whatever other strategic, technical and out-of-the-box proposals people have that they think could help.)
Well, the obvious #1 question: A myopic AGI is a weaker one, so what is the weaker pivotal act you mean to perform with this putative weaker AGI? A strange thing to omit from one’s discussion of machinery—the task that the machinery is to perform.
Myopia seems to me like a confused concept because the structure of the world is non-myopic, so to speak. If you myopically try to deal with rocks, you’ll for myopic reasons model a rock as a permanent objects with a particular shape. But the rock also behaves as a permanent object over much longer time scales than your myopia. So you’ve in some important sense accessed time-scales much longer than your myopia. I think this happens at any level of a mind. If so, then minds with myopic goals are very similar to minds with non-myopic goals; so similar that they may be basically the same because they’ll have non-myopic strategic components that exert their own non-myopic agency.
Here’s a related comment thread debating myopia. This one includes you (TekhneMakre), evhub, Eliezer and others. I’m reading it now to see if there are any cruxes that could help in our present discussion:
[Upvoted for looking over past stuff.] On reflection I’m not being that clear in this present thread, and am open to you making a considered counterargument / explanation and then me thinking that over for a longer amount of time to try writing a clearer response / change my mind / etc.
I’m saying that, for the same reason that myopic agents think about rocks the same way non-myopic agents think about rocks, also myopic agents will care about long-term stuff the same way non-myopic agents do. The thinking needed to make cool stuff happen generalizes like the thinking needed to deal with rocks. So yeah, you can say “myopic agents by definition don’t care about long-term stuff”, but if by care you mean the thing that actually matters, the thing about causing stuff to happen, then you’ve swept basically the entire problem under the rug.
They *could*, but we don’t know how to separate caring from thinking, modeling, having effects; and the first 1000 programs that think about long term stuff that you find just by looking for programs that think about long term stuff, also care about long term stuff.
What you’re saying seems to contradict the orthogonality thesis. Intelligence level and goals are independent, or at least not tightly interdependent.
Let’s use the common example of a paperclip maximizer. Maximizing total long-term paperclips is a strange goal for an agent to have, but most people in AI alignment think it’s possible that an AI could be trained to optimize for something like thislike this could in principle emerge from training (though we don’t know how to reliably train one on purpose).
Now why couldn’t an agent by motivated to maximize short-term paperclips? It wants more paperclips, but it will always take 1 paperclip now over 1 or even 10 or 100 a minute in the future. It wants paperclips ASAP. This is one contrived example of what a myopic AI might look like—a myopic paperclip maximizer.
I was trying to contrast the myopic paperclip maximizer idea with the classic paperclip maximizer. Perhaps “long-term” was a lousy choice of words. What would be better: simple paperclip maximizer, unconditional paperclip maximizer, or something?
Update: On second thought, maybe what you were getting at is that it’s not clear how to deliberately train a paperclip maximizer in the current paradigm. If you tried, you’d likely end up with a mesa-optimizer on some unpredictable proxy objective, like a deceptively aligned steel maximizer.
Yes, I’m saying that AIs are very likely to have (in a broad sense, including e.g. having subagents that have) long-term goals.
Now why couldn’t an agent by motivated to maximize short-term paperclips?
It *could*, but I’m saying that making an AI like that isn’t like choosing a loss function for training, because long-term thinking is convergent.
Your original comment said:
I can’t see anything unnatural about an agent that has both consequentialist reasoning capabilities and a high time preference.
This is what I’m arguing against. I’m saying it’s very unnatural. *Possible*, but very unnatural.
And:
This means that it would never sacrifice reward now for reward later, and so it would essentially be exempt from instrumental convergence.
This sounds like you’re saying that myopia *makes* there not be convergent instrumental goals. I’m saying myopia basically *implies* there not being convergent instrumental goals, and therefore is at least as hard as making there not be CIGs.
most people in AI alignment think it’s possible that an AI could be trained to optimize for something like this.
I don’t think we have any idea how to do this. If we knew how to get an AGI system to reliably maximize the number of paperclips in the universe, that might be most of the (strawberry-grade) alignment problem solved right there.
You’re right, my mistake—of course we don’t know how to deliberately and reliably train a paperclip maximizer. I’ve updated the parent comment now to say:
most people in AI alignment think it’s possible that an AI like this could in principle emerge from training (though we don’t know how to reliably train one on purpose).
It feels like you are setting a discount rate higher than reality demands. A rationally intelligent agent should wind up with a discount rate that matches reality (e.g. in this case, probably the rate at which paper clips decay or the global real rate of interest).
There is one approach to corrigibility that I don’t see mentioned in the “tried and failed” post Eliezer linked to here. It’s also one that someone at MIRI (Evan Hubinger) among others is still working on: myopia (i.e. myopic cognition).
There are different formulations, but the basic idea is that an AI with myopic cognition would have an extremely high time preference. This means that it would never sacrifice reward now for reward later, and so it would essentially be exempt from instrumental convergence. In theory such an AI would allow itself to be shut down (without forcing shutdown), and it would also not be prone to deceptive alignment.
Myopia isn’t fully understood yet and has a number of open problems. It also will likely require verification using advanced interpretability tools that haven’t been developed yet. I think it’s a research direction we as a field should be investing in to figure out if it can work though, and the corrigibility question shouldn’t be considered closed until we’ve at least done that. I can’t see anything unnatural about an agent that has both consequentialist reasoning capabilities and a high time preference.
(Note: I’m not suggesting that we should bet the farm on myopic cognition solving alignment, and I’m not suggesting that my critique of Eliezer’s point on corrigibility in this comment undermines the overall idea of his post that we’re in a very scary situation with AI x-risk. I agree with that and support spreading the word about it as he’s doing here, as well as working directly with leading AI labs to try and avoid catastrophe. I also support a number of other technical research directions including interpretability, and I’m open to whatever other strategic, technical and out-of-the-box proposals people have that they think could help.)
Well, the obvious #1 question: A myopic AGI is a weaker one, so what is the weaker pivotal act you mean to perform with this putative weaker AGI? A strange thing to omit from one’s discussion of machinery—the task that the machinery is to perform.
Myopia seems to me like a confused concept because the structure of the world is non-myopic, so to speak. If you myopically try to deal with rocks, you’ll for myopic reasons model a rock as a permanent objects with a particular shape. But the rock also behaves as a permanent object over much longer time scales than your myopia. So you’ve in some important sense accessed time-scales much longer than your myopia. I think this happens at any level of a mind. If so, then minds with myopic goals are very similar to minds with non-myopic goals; so similar that they may be basically the same because they’ll have non-myopic strategic components that exert their own non-myopic agency.
Here’s a related comment thread debating myopia. This one includes you (TekhneMakre), evhub, Eliezer and others. I’m reading it now to see if there are any cruxes that could help in our present discussion:
https://www.lesswrong.com/posts/5ciYedyQDDqAcrDLr/a-positive-case-for-how-we-might-succeed-at-prosaic-ai?commentId=st5tfgpwnhJrkHaWp
[Upvoted for looking over past stuff.] On reflection I’m not being that clear in this present thread, and am open to you making a considered counterargument / explanation and then me thinking that over for a longer amount of time to try writing a clearer response / change my mind / etc.
I suppose the point is that a myopic agent will accept/know that a rock will exist for long time-scales it just won’t care.
Plenty of smart but short-sighted people so not inconceivable.
I’m saying that, for the same reason that myopic agents think about rocks the same way non-myopic agents think about rocks, also myopic agents will care about long-term stuff the same way non-myopic agents do. The thinking needed to make cool stuff happen generalizes like the thinking needed to deal with rocks. So yeah, you can say “myopic agents by definition don’t care about long-term stuff”, but if by care you mean the thing that actually matters, the thing about causing stuff to happen, then you’ve swept basically the entire problem under the rug.
Why can myopic agents not think about long-term stuff the same way as non-myopic agents but still not care about long-term stuff?
They *could*, but we don’t know how to separate caring from thinking, modeling, having effects; and the first 1000 programs that think about long term stuff that you find just by looking for programs that think about long term stuff, also care about long term stuff.
What you’re saying seems to contradict the orthogonality thesis. Intelligence level and goals are independent, or at least not tightly interdependent.
Let’s use the common example of a paperclip maximizer. Maximizing total long-term paperclips is a strange goal for an agent to have, but most people in AI alignment think it’s possible that an AI
could be trained to optimize for something like thislike this could in principle emerge from training (though we don’t know how to reliably train one on purpose).Now why couldn’t an agent by motivated to maximize short-term paperclips? It wants more paperclips, but it will always take 1 paperclip now over 1 or even 10 or 100 a minute in the future. It wants paperclips ASAP. This is one contrived example of what a myopic AI might look like—a myopic paperclip maximizer.
I don’t think we could train an AI to optimize for long-term paperclips. Maybe I’m not “most people in AI alignment” but still, just saying.
I was trying to contrast the myopic paperclip maximizer idea with the classic paperclip maximizer. Perhaps “long-term” was a lousy choice of words. What would be better: simple paperclip maximizer, unconditional paperclip maximizer, or something?
Update: On second thought, maybe what you were getting at is that it’s not clear how to deliberately train a paperclip maximizer in the current paradigm. If you tried, you’d likely end up with a mesa-optimizer on some unpredictable proxy objective, like a deceptively aligned steel maximizer.
Yes, I’m saying that AIs are very likely to have (in a broad sense, including e.g. having subagents that have) long-term goals.
It *could*, but I’m saying that making an AI like that isn’t like choosing a loss function for training, because long-term thinking is convergent.
Your original comment said:
This is what I’m arguing against. I’m saying it’s very unnatural. *Possible*, but very unnatural.
And:
This sounds like you’re saying that myopia *makes* there not be convergent instrumental goals. I’m saying myopia basically *implies* there not being convergent instrumental goals, and therefore is at least as hard as making there not be CIGs.
I don’t think we have any idea how to do this. If we knew how to get an AGI system to reliably maximize the number of paperclips in the universe, that might be most of the (strawberry-grade) alignment problem solved right there.
You’re right, my mistake—of course we don’t know how to deliberately and reliably train a paperclip maximizer. I’ve updated the parent comment now to say:
It feels like you are setting a discount rate higher than reality demands. A rationally intelligent agent should wind up with a discount rate that matches reality (e.g. in this case, probably the rate at which paper clips decay or the global real rate of interest).