You cannot successfully trick/fight/outsmart a superintelligence. Your contingency plans would look clumsy and transparent to it. Even laughable, if it has a sense of humor. If a self-modifiable intelligence finds that its initial risk aversion or discount rate does not match its models of the world, it will fix this programming error and march on. The measures you suggest might only work for a moderately smart agent unable to recursively self-improve.
Don’t know if it’s all that useful, but let’s try...
I imagine the AI still being boxed, and that we can still modify its motivational structure (I have a post coming up on how to do that so that the AI doesn’t object/resist). And that’s about it. I’ve tried to keep it as general as possible, so that it could also be used on AI designs made by different groups.
What’s our definition of “trick”, in this context? For the simplest example, when we hook AIXI-MC up to the controls of Pac Man and observe, technically are we “tricking” it into thinking that the universe contains nothing but mazes, ghosts, and pellets?
If we can’t instill any values at all we’re screwed regardless of what we do. Designs that change their values in order to win more resources are UAI by definition.
You cannot successfully trick/fight/outsmart a superintelligence. Your contingency plans would look clumsy and transparent to it. Even laughable, if it has a sense of humor. If a self-modifiable intelligence finds that its initial risk aversion or discount rate does not match its models of the world, it will fix this programming error and march on. The measures you suggest might only work for a moderately smart agent unable to recursively self-improve.
I know that they cannot be tricked. And discount rates are about motivations, not about models of the world.
Plus, I envisage this being used rather early in the development of intelligence, as a test for putative utilities/motivations.
Do you mind elaborating on the expected AI capabilities at that point?
Don’t know if it’s all that useful, but let’s try...
I imagine the AI still being boxed, and that we can still modify its motivational structure (I have a post coming up on how to do that so that the AI doesn’t object/resist). And that’s about it. I’ve tried to keep it as general as possible, so that it could also be used on AI designs made by different groups.
What’s our definition of “trick”, in this context? For the simplest example, when we hook AIXI-MC up to the controls of Pac Man and observe, technically are we “tricking” it into thinking that the universe contains nothing but mazes, ghosts, and pellets?
If we can’t instill any values at all we’re screwed regardless of what we do. Designs that change their values in order to win more resources are UAI by definition.
The degree of risk aversion is not a value.
I’m not confident that risk aversion and discount rate aren’t tied into values.
I am not 100% confident, either. I guess we’ll have to wait for someone more capable to do a simulation or a calculation.
(redundant with Stuart’s replies)