This is a good point, and new to LW as far as I know:
It does intuitively seem like a genie which does what it is told, but not what is meant, would be easier to make, because it is a worse, less useful genie, and if it was for sale, it would have a lower market price. But in practice, the “told”/”meant” distinction does not carve reality at the joints and primarily applies to the plausible deniability.
Congratulations! Please keep up this sort of work.
As a counterpoint, some goals that potentially touch the real world (and possibly make the AI kill everyone) might have a shorter path to formalization that doesn’t require quite as much understanding of human internals. For example, something like this might be possible to formalize directly: “Try to find a proof of theorem X, possibly using resources from the external world. Use this simple mathematical prior over possible external worlds.” That seems like a computationally intractable task for an AI, but might become tractable if the AI can self-improve at math (another task which doesn’t seem to require understanding humans).
Try to find a proof of theorem X, possibly using resources from the external world.
This could be an inspiration for a sci-fi movie: A group of scientists created a superhuman AI and asked it to prove the theorem X, using resources from the external world. The Unfriendly AI quickly took control over most of the planet, enslaved all humans and used them as cheap labor to build more and more microprocessors.
A group of rebels fights against the AI, but is gradually defeated. Just when the AI is trying to kill the protagonist and/or the people dearest to the protagonist, the protagonist finally understands the motivation of the AI. AI does not love humans, but neither it hates them… it is merely trying to get as much computing power as possible to solve the theorem X, which is the task it was programmed to do. So our young hero takes a pen and paper, solves the theorem X, and shows the solution to the AI… which upon seeing the solution prints the output and halts. Humans are saved!
(And if this does not seem like a successful movie scenario, maybe there is something wrong with our intuitions about superhuman AIs.)
It doesn’t halt, because it can’t be perfectly certain that the proof is correct. There are alternative explanations for the computation it implemented reporting that the proof was found, such as corrupted hardware used for checking the proof, or errors in the design of proof-checking algorithms, possibly introduced because of corrupted hardware used to design the proof checker, and so on.
Since it can’t ever be perfectly certain, there is always more to be done in the service of its goal, such as building more redundant hardware and staging experiments to refine its understanding of physical world in order to be able to more fully rely on hardware.
The Unfriendly AI quickly took control over most of the planet, enslaved all humans and used them as cheap labor to build more and more microprocessors.
While this is better than using humans as a power source it still seems like there are more efficient configurations of matter that could achieve this task.
Assuming that the AI builds machines that aren’t directly controlled by the AI itself, it doesn’t have any incentive to build the machines such that they stop working once a proof is found.
Not that realism is a primary objective in most SF movies.
Good point. The AI would probably build the simplest or cheapest machines to do the job, so their behavior when the AI stops giving them commands would be… unspecified explicitly… so they would probably do something meaningless, which would have been meaningful if the AI would still work.
For example, they could contain a code: “if you lose signal from the AI, climb to the highest place you see, until you catch the signal again” (coded assuming they lost the signal because they were deep underground of something), then the machines would just start climbing to the tops of buildings and mountains.
But also, their code could be: “wait until you get the signal again, and while doing that, destroy any humans around you” (coded assuming those humans are probably somehow responsible for the loss of signal), in which case the machines would continue fighting.
The worst case: The AI would assume it could be destroyed (by humans, natural disaster, or anything else), so the machines would have an instruction to rebuild the AI somewhere else. Actually, this seems like a pretty likely case. The new AI would not know about the proof, so it would start fighting again. And if destroyed, a new AI would be built again, and again. -- The original AI does not have motivation to make the proof known to its possible clones.
Try to find a proof of theorem X, possibly using resources from the external world. Use this simple mathematical prior over possible external worlds.
A proof is not a known world state, though. The easy way would be to have a proof checker in the world model and the goal is to pound on the world in such a way so that the proof checker says “proof valid” (a known world state on the output of the proof checker), the obvious solution is to mess with the proof checker, while actual use of the resources runs into the problem of prediction of what this use will, exactly, produce, or how exactly it will happen—you don’t know what’s going to go into the proof checker if you act in the proof-finding-using-external-resources way. And if you don’t represent all parts of the AI as embodied in the real world then the AI can not predict consequences of the damage to physical structures representing it.
The real killer though is that you got a really huge model, for which you need a lot of computational resources to begin with. Plus with a simple prior over possible worlds, you will be dealing with very super duper fundamental laws of physics (below quarks). A very huge number of technologies each of which is a lot more useful elsewhere.
I know the standard response to this, it’s that if something doesn’t work someone tries something different, but the different is very simple to picture: you restrict the model, a lot, which a: speeds up the AI by a mindbogglingly huge factor, and b: eliminates most of unwanted exploration (both are intrinsically related). You can’t tell AI to “self improve”, either, you have to define what improvement is, and a lot of improvement is about better culling of anything you can cull.
Congratulations! Please keep up this sort of work.
Thanks, I guess, but I do not view it as work. I am sick with cold, bored, and burnt out from doing actual work, and suffering from “someone wrong on the internet” syndrome, in combination with knowing that extremely rationalized wrongitude affects people like you.
This is a good point, and new to LW as far as I know:
Congratulations! Please keep up this sort of work.
As a counterpoint, some goals that potentially touch the real world (and possibly make the AI kill everyone) might have a shorter path to formalization that doesn’t require quite as much understanding of human internals. For example, something like this might be possible to formalize directly: “Try to find a proof of theorem X, possibly using resources from the external world. Use this simple mathematical prior over possible external worlds.” That seems like a computationally intractable task for an AI, but might become tractable if the AI can self-improve at math (another task which doesn’t seem to require understanding humans).
This could be an inspiration for a sci-fi movie: A group of scientists created a superhuman AI and asked it to prove the theorem X, using resources from the external world. The Unfriendly AI quickly took control over most of the planet, enslaved all humans and used them as cheap labor to build more and more microprocessors.
A group of rebels fights against the AI, but is gradually defeated. Just when the AI is trying to kill the protagonist and/or the people dearest to the protagonist, the protagonist finally understands the motivation of the AI. AI does not love humans, but neither it hates them… it is merely trying to get as much computing power as possible to solve the theorem X, which is the task it was programmed to do. So our young hero takes a pen and paper, solves the theorem X, and shows the solution to the AI… which upon seeing the solution prints the output and halts. Humans are saved!
(And if this does not seem like a successful movie scenario, maybe there is something wrong with our intuitions about superhuman AIs.)
It doesn’t halt, because it can’t be perfectly certain that the proof is correct. There are alternative explanations for the computation it implemented reporting that the proof was found, such as corrupted hardware used for checking the proof, or errors in the design of proof-checking algorithms, possibly introduced because of corrupted hardware used to design the proof checker, and so on.
Since it can’t ever be perfectly certain, there is always more to be done in the service of its goal, such as building more redundant hardware and staging experiments to refine its understanding of physical world in order to be able to more fully rely on hardware.
While this is better than using humans as a power source it still seems like there are more efficient configurations of matter that could achieve this task.
Assuming that the AI builds machines that aren’t directly controlled by the AI itself, it doesn’t have any incentive to build the machines such that they stop working once a proof is found.
Not that realism is a primary objective in most SF movies.
Good point. The AI would probably build the simplest or cheapest machines to do the job, so their behavior when the AI stops giving them commands would be… unspecified explicitly… so they would probably do something meaningless, which would have been meaningful if the AI would still work.
For example, they could contain a code: “if you lose signal from the AI, climb to the highest place you see, until you catch the signal again” (coded assuming they lost the signal because they were deep underground of something), then the machines would just start climbing to the tops of buildings and mountains.
But also, their code could be: “wait until you get the signal again, and while doing that, destroy any humans around you” (coded assuming those humans are probably somehow responsible for the loss of signal), in which case the machines would continue fighting.
The worst case: The AI would assume it could be destroyed (by humans, natural disaster, or anything else), so the machines would have an instruction to rebuild the AI somewhere else. Actually, this seems like a pretty likely case. The new AI would not know about the proof, so it would start fighting again. And if destroyed, a new AI would be built again, and again. -- The original AI does not have motivation to make the proof known to its possible clones.
A proof is not a known world state, though. The easy way would be to have a proof checker in the world model and the goal is to pound on the world in such a way so that the proof checker says “proof valid” (a known world state on the output of the proof checker), the obvious solution is to mess with the proof checker, while actual use of the resources runs into the problem of prediction of what this use will, exactly, produce, or how exactly it will happen—you don’t know what’s going to go into the proof checker if you act in the proof-finding-using-external-resources way. And if you don’t represent all parts of the AI as embodied in the real world then the AI can not predict consequences of the damage to physical structures representing it.
The real killer though is that you got a really huge model, for which you need a lot of computational resources to begin with. Plus with a simple prior over possible worlds, you will be dealing with very super duper fundamental laws of physics (below quarks). A very huge number of technologies each of which is a lot more useful elsewhere.
I know the standard response to this, it’s that if something doesn’t work someone tries something different, but the different is very simple to picture: you restrict the model, a lot, which a: speeds up the AI by a mindbogglingly huge factor, and b: eliminates most of unwanted exploration (both are intrinsically related). You can’t tell AI to “self improve”, either, you have to define what improvement is, and a lot of improvement is about better culling of anything you can cull.
Thanks, I guess, but I do not view it as work. I am sick with cold, bored, and burnt out from doing actual work, and suffering from “someone wrong on the internet” syndrome, in combination with knowing that extremely rationalized wrongitude affects people like you.