I like the main point; I hadn’t considered it before with value learning. Trying to ask myself why I haven’t been worried about this sort of failure mode before, I get the following:
It seems all of the harms to humans the value-learner causes are from some direct or indirect interaction with humans, so instead I want to imagine a pre-training step that learns as much about human values from existing sources (internet, books, movies, etc) without interacting with humans.
Then as a second step, this value-learner is now allowed to interact with humans in order to continue it’s learning.
So your problem in this scenario comes down to: Is the pre-training step insufficient to cause the value-learner to avoid causing harms in the second step?
I’m not certain, but here it at least seems more reasonable that it might not. In particular, if the value-learner were sufficiently uncertain about things like harms (given the baseline access to human values from the pre-training step) it might be able to safely continue learning about human values.
Right now I think I’d be 80⁄20 that a pre-training step that learned human values without interacting with humans from existing media would be sufficient to prevent significant harms during the second stage of value learning.
(This doesn’t rule out other superintelligence risks / etc, and is just a statement about the risks incurred during value learning that you list)
To learn human values from, say, fixed texts is a good start, but it doesn’t solve the “chicken or the egg problem”: that we start from running non-aligned AI which is learning human values, but we want the first AI to be already aligned.
One possible obstacke: non-aligned AI could run away before it has finished to learn human values from the texts.
The problem of chicken and the egg could presumably be solved by some iteration-and-distillation approach. First we give some very rough model of human values (or rules) to some limited AI, and later we increase its power and its access to real human. But this suffers from all the difficulties of the iteration-and-distillation, like unexpected jumps of capabilities.
I like the main point; I hadn’t considered it before with value learning. Trying to ask myself why I haven’t been worried about this sort of failure mode before, I get the following:
It seems all of the harms to humans the value-learner causes are from some direct or indirect interaction with humans, so instead I want to imagine a pre-training step that learns as much about human values from existing sources (internet, books, movies, etc) without interacting with humans.
Then as a second step, this value-learner is now allowed to interact with humans in order to continue it’s learning.
So your problem in this scenario comes down to: Is the pre-training step insufficient to cause the value-learner to avoid causing harms in the second step?
I’m not certain, but here it at least seems more reasonable that it might not. In particular, if the value-learner were sufficiently uncertain about things like harms (given the baseline access to human values from the pre-training step) it might be able to safely continue learning about human values.
Right now I think I’d be 80⁄20 that a pre-training step that learned human values without interacting with humans from existing media would be sufficient to prevent significant harms during the second stage of value learning.
(This doesn’t rule out other superintelligence risks / etc, and is just a statement about the risks incurred during value learning that you list)
To learn human values from, say, fixed texts is a good start, but it doesn’t solve the “chicken or the egg problem”: that we start from running non-aligned AI which is learning human values, but we want the first AI to be already aligned.
One possible obstacke: non-aligned AI could run away before it has finished to learn human values from the texts.
The problem of chicken and the egg could presumably be solved by some iteration-and-distillation approach. First we give some very rough model of human values (or rules) to some limited AI, and later we increase its power and its access to real human. But this suffers from all the difficulties of the iteration-and-distillation, like unexpected jumps of capabilities.