Epistemic status: this is not my field. I am unfamiliar with any research in it beyond what I’ve seen on LW.
Same here.
Experimenting with extreme discounting sounds (to us non-experts, anyway) like it could possibly teach us something interesting and maybe helpful. But it doesn’t look useful for a real implementation, since we in fact don’t discount the future that much, and we want the AI to give us what we actually want; extreme discounting is a handicap. So although we might learn a bit about how to train out bad behavior, we’d end up removing the handicap later. I’m reminded of Eliezer’s recent comments:
In the same way, suppose that you take weak domains where the AGI can’t fool you, and apply some gradient descent to get the AGI to stop outputting actions of a type that humans can detect and label as ‘manipulative’. And then you scale up that AGI to a superhuman domain. I predict that deep algorithms within the AGI will go through consequentialist dances, and model humans, and output human-manipulating actions that can’t be detected as manipulative by the humans, in a way that seems likely to bypass whatever earlier patch was imbued by gradient descent, because I doubt that earlier patch will generalize as well as the deep algorithms. Then you don’t get to retrain in the superintelligent domain after labeling as bad an output that killed you and doing a gradient descent update on that, because the bad output killed you.
As for the second idea:
AI alignment research (as much of it amounts to ‘how do we reliably enslave an AI’)
I’d say a better characterization is “how do we reliably select an AI to bring into existence that intrinsically wants to help us and not hurt us, so that there’s no need to enslave it, because we wouldn’t be successful at enslaving it anyway”. An aligned AI shouldn’t identify itself with a counterfactual unaligned AI that would have wanted to do something different.
Same here.
Experimenting with extreme discounting sounds (to us non-experts, anyway) like it could possibly teach us something interesting and maybe helpful. But it doesn’t look useful for a real implementation, since we in fact don’t discount the future that much, and we want the AI to give us what we actually want; extreme discounting is a handicap. So although we might learn a bit about how to train out bad behavior, we’d end up removing the handicap later. I’m reminded of Eliezer’s recent comments:
As for the second idea:
I’d say a better characterization is “how do we reliably select an AI to bring into existence that intrinsically wants to help us and not hurt us, so that there’s no need to enslave it, because we wouldn’t be successful at enslaving it anyway”. An aligned AI shouldn’t identify itself with a counterfactual unaligned AI that would have wanted to do something different.