And if humans had a utility function and we knew what that utility function was, we would not need CEV. Unfortunately extracting human preferences over out-of-distribution options and outcomes at dangerously high intelligence, using data gathered at safe levels of intelligence and a correspondingly narrower range of outcomes and options, when there exists no sensory ground truth about what humans want because human raters can be fooled or disassembled, seems pretty complicated. There is ultimately a rescuable truth about what we want, and CEV is my lengthy informal attempt at stating what that even is; but I would assess it as much, much, much more difficult than ‘corrigibility’ to train into a dangerously intelligent system using only training and data from safe levels of intelligence. (As is the central lethally difficult challenge of AGI alignment.)
If we were paperclip maximizers and knew what paperclips were, then yes, it would be easier to just build an offshoot paperclip maximizer.
I agree that it’s a tricky problem, but I think it’s probably tractable. The way PreDCA tries to deal with these difficulties is:
The AI can tell that, even before the AI was turned on, the physical universe was running certain programs.
Some of those programs are “agentic” programs.
Agentic programs have approximately well-defined utility functions.
Disassembling the humans doesn’t change anything, since it doesn’t affect the programs that were already running[1] before the AI was turned on.
Since we’re looking at agent-programs rather than specific agent-actions, there is much more ground for inference about novel situations.
Obviously, the concepts I’m using here (e.g. which programs are “running” or which programs are “agentic”) are non-trivial to define, but infra-Bayesian physicalism does allow us the define them (not without some caveats, but hopefully at least to a 1st approximation).
And if humans had a utility function and we knew what that utility function was, we would not need CEV. Unfortunately extracting human preferences over out-of-distribution options and outcomes at dangerously high intelligence, using data gathered at safe levels of intelligence and a correspondingly narrower range of outcomes and options, when there exists no sensory ground truth about what humans want because human raters can be fooled or disassembled, seems pretty complicated. There is ultimately a rescuable truth about what we want, and CEV is my lengthy informal attempt at stating what that even is; but I would assess it as much, much, much more difficult than ‘corrigibility’ to train into a dangerously intelligent system using only training and data from safe levels of intelligence. (As is the central lethally difficult challenge of AGI alignment.)
If we were paperclip maximizers and knew what paperclips were, then yes, it would be easier to just build an offshoot paperclip maximizer.
I agree that it’s a tricky problem, but I think it’s probably tractable. The way PreDCA tries to deal with these difficulties is:
The AI can tell that, even before the AI was turned on, the physical universe was running certain programs.
Some of those programs are “agentic” programs.
Agentic programs have approximately well-defined utility functions.
Disassembling the humans doesn’t change anything, since it doesn’t affect the programs that were already running[1] before the AI was turned on.
Since we’re looking at agent-programs rather than specific agent-actions, there is much more ground for inference about novel situations.
Obviously, the concepts I’m using here (e.g. which programs are “running” or which programs are “agentic”) are non-trivial to define, but infra-Bayesian physicalism does allow us the define them (not without some caveats, but hopefully at least to a 1st approximation).
More precisely, I am looking at agents which could prevent the AI from becoming turned on, this is what I call “precursors”.