It sounds like your intuition is that alignment is hard. My view is that both corrigibility and value alignment are easy, much easier than general autonomous intelligence. We can’t really argue over intuitions though.
The way I see it, the sort of thinking that leads to pessimism about alignment starts and ends with an inability to distinguish optimization from intelligence. Indeed, if you define intelligence as “that which achieves optimization” then you’ve essentially defined for yourself an unsolvable problem. Fortunately, there are plenty of forms of intelligence that are not described by this pure consequentialist universalizing superoptimization concept (ie Clippy).
Consider a dog: a dog doesn’t try to take over the world, or even your house, but dogs are still more intelligent (able to operate in the physical world) than any robot, and dogs are also quite corrigible. Large numbers of humans are also corrigible, although I hesitate to try to describe a corrigible human because that will get into category debates that aren’t useful for what I’m trying to point at. My point is just that corrigibility is not rare, at any level of intelligence. I was trying to make this argument with my post The Bomb that doesn’t Explode but I don’t think I was clear enough.
Dogs and humans also can’t be used to get much leverage on pivotal acts.
A pivotal act, or a bunch of acts that add up to being pivotal, imply that the actor was taking actions that make the world end up some way. The only way we currently know to summon computer programs that take actions that make the world end up some way, is to run some kind of search (such as gradient descent) for computations that make the world end up some way. The simple way to make the world end up some way, is to look in general for actions that make the world end up some way. Since that’s the simple way, that’s what’s found by unstructured search. If a computer program makes the world end up some way by in general looking for and taking actions that make that happen, and that computer program can understand and modify itself, then, it is not corrigible, because corrigibility is in general a property that makes the world not end up the way the program is looking for actions to cause, so it would be self-modified away.
A robot with the intelligence and ability of a dog would be pretty economically useful without being dangerous. I’m working on a post to explore this with the title “Why do we want AI?”
To be honest, when you talk about pivotal acts, it looks like you are trying to take over the world.
Well, to be clear, I am not at all an expert on AI alignment—my impression from reading about the topic is that I find reasons for the impossibility of alignment agreeable while I did not yet find any test telling me why alignment should be easy. But maybe I’ll find that in your sequence, once that it consists of more posts?
Perhaps! I am working on more posts. I’m not necessarily trying to prove anything though, and I’m not an expert on AI alignment. Part of the point of writing is so that I can understand these issues better myself.
It sounds like your intuition is that alignment is hard. My view is that both corrigibility and value alignment are easy, much easier than general autonomous intelligence. We can’t really argue over intuitions though.
Why is your view that corrigibility is easy?
The way I see it, the sort of thinking that leads to pessimism about alignment starts and ends with an inability to distinguish optimization from intelligence. Indeed, if you define intelligence as “that which achieves optimization” then you’ve essentially defined for yourself an unsolvable problem. Fortunately, there are plenty of forms of intelligence that are not described by this pure consequentialist universalizing superoptimization concept (ie Clippy).
Consider a dog: a dog doesn’t try to take over the world, or even your house, but dogs are still more intelligent (able to operate in the physical world) than any robot, and dogs are also quite corrigible. Large numbers of humans are also corrigible, although I hesitate to try to describe a corrigible human because that will get into category debates that aren’t useful for what I’m trying to point at. My point is just that corrigibility is not rare, at any level of intelligence. I was trying to make this argument with my post The Bomb that doesn’t Explode but I don’t think I was clear enough.
Dogs and humans also can’t be used to get much leverage on pivotal acts.
A pivotal act, or a bunch of acts that add up to being pivotal, imply that the actor was taking actions that make the world end up some way. The only way we currently know to summon computer programs that take actions that make the world end up some way, is to run some kind of search (such as gradient descent) for computations that make the world end up some way. The simple way to make the world end up some way, is to look in general for actions that make the world end up some way. Since that’s the simple way, that’s what’s found by unstructured search. If a computer program makes the world end up some way by in general looking for and taking actions that make that happen, and that computer program can understand and modify itself, then, it is not corrigible, because corrigibility is in general a property that makes the world not end up the way the program is looking for actions to cause, so it would be self-modified away.
A robot with the intelligence and ability of a dog would be pretty economically useful without being dangerous. I’m working on a post to explore this with the title “Why do we want AI?”
To be honest, when you talk about pivotal acts, it looks like you are trying to take over the world.
Not take over the world, but prevent pivot unaligned incorrigible AI from destroying the world.
Also, cross-domain optimisation doesn’t exist in an strong sense because of the no free lunch theorem.
Well, to be clear, I am not at all an expert on AI alignment—my impression from reading about the topic is that I find reasons for the impossibility of alignment agreeable while I did not yet find any test telling me why alignment should be easy. But maybe I’ll find that in your sequence, once that it consists of more posts?
Perhaps! I am working on more posts. I’m not necessarily trying to prove anything though, and I’m not an expert on AI alignment. Part of the point of writing is so that I can understand these issues better myself.