The way I see it, the sort of thinking that leads to pessimism about alignment starts and ends with an inability to distinguish optimization from intelligence. Indeed, if you define intelligence as “that which achieves optimization” then you’ve essentially defined for yourself an unsolvable problem. Fortunately, there are plenty of forms of intelligence that are not described by this pure consequentialist universalizing superoptimization concept (ie Clippy).
Consider a dog: a dog doesn’t try to take over the world, or even your house, but dogs are still more intelligent (able to operate in the physical world) than any robot, and dogs are also quite corrigible. Large numbers of humans are also corrigible, although I hesitate to try to describe a corrigible human because that will get into category debates that aren’t useful for what I’m trying to point at. My point is just that corrigibility is not rare, at any level of intelligence. I was trying to make this argument with my post The Bomb that doesn’t Explode but I don’t think I was clear enough.
Dogs and humans also can’t be used to get much leverage on pivotal acts.
A pivotal act, or a bunch of acts that add up to being pivotal, imply that the actor was taking actions that make the world end up some way. The only way we currently know to summon computer programs that take actions that make the world end up some way, is to run some kind of search (such as gradient descent) for computations that make the world end up some way. The simple way to make the world end up some way, is to look in general for actions that make the world end up some way. Since that’s the simple way, that’s what’s found by unstructured search. If a computer program makes the world end up some way by in general looking for and taking actions that make that happen, and that computer program can understand and modify itself, then, it is not corrigible, because corrigibility is in general a property that makes the world not end up the way the program is looking for actions to cause, so it would be self-modified away.
A robot with the intelligence and ability of a dog would be pretty economically useful without being dangerous. I’m working on a post to explore this with the title “Why do we want AI?”
To be honest, when you talk about pivotal acts, it looks like you are trying to take over the world.
Why is your view that corrigibility is easy?
The way I see it, the sort of thinking that leads to pessimism about alignment starts and ends with an inability to distinguish optimization from intelligence. Indeed, if you define intelligence as “that which achieves optimization” then you’ve essentially defined for yourself an unsolvable problem. Fortunately, there are plenty of forms of intelligence that are not described by this pure consequentialist universalizing superoptimization concept (ie Clippy).
Consider a dog: a dog doesn’t try to take over the world, or even your house, but dogs are still more intelligent (able to operate in the physical world) than any robot, and dogs are also quite corrigible. Large numbers of humans are also corrigible, although I hesitate to try to describe a corrigible human because that will get into category debates that aren’t useful for what I’m trying to point at. My point is just that corrigibility is not rare, at any level of intelligence. I was trying to make this argument with my post The Bomb that doesn’t Explode but I don’t think I was clear enough.
Dogs and humans also can’t be used to get much leverage on pivotal acts.
A pivotal act, or a bunch of acts that add up to being pivotal, imply that the actor was taking actions that make the world end up some way. The only way we currently know to summon computer programs that take actions that make the world end up some way, is to run some kind of search (such as gradient descent) for computations that make the world end up some way. The simple way to make the world end up some way, is to look in general for actions that make the world end up some way. Since that’s the simple way, that’s what’s found by unstructured search. If a computer program makes the world end up some way by in general looking for and taking actions that make that happen, and that computer program can understand and modify itself, then, it is not corrigible, because corrigibility is in general a property that makes the world not end up the way the program is looking for actions to cause, so it would be self-modified away.
A robot with the intelligence and ability of a dog would be pretty economically useful without being dangerous. I’m working on a post to explore this with the title “Why do we want AI?”
To be honest, when you talk about pivotal acts, it looks like you are trying to take over the world.
Not take over the world, but prevent pivot unaligned incorrigible AI from destroying the world.
Also, cross-domain optimisation doesn’t exist in an strong sense because of the no free lunch theorem.