Let’s think of the classical Cake or Death problem from the point of view of corrigibility. The aim here is to construct a toy model sufficiently complex that it shows all the problems that derail classical value learning and corrigibility.
The utility uc is linear in cakes baked, the utility ud is linear in deaths caused. The agent currently has equal credence between the two utilities. It is, in fact, easier for the agent to cause deaths than bake cakes.
The AI updates its utility via human answers to its questions. In practice, what does that mean?
It can ask Happy Harry which is better, or it can ask Gory Gertrude. Harry will answer “Cake!” while Gertrude will answer “Death!”. There are actually 100 Harrys in the world, and 1 Gertrude. The AI has to decide who to ask the question to. The AI is also, incidentally, responsible for incubating the next generation of humans; it will create 1000 of them, and can chose any proportion of Harrys and Gertrudes among them. It will ask its question after this generation is created.
The AI has two additional options: it can refuse to ask anyone, and it can remove its learning process if it wants to.
The preceding is intended to give the AI a lot of control over what the answer to its question will be, while removing any clear default option.
Here are a few issues to ask of any approach to moral learning:
#. How does it behave before asking the question?
#. What happens when it figures out different probabilities for the human answers?
#. Does it want to manipulate the learning process?
#. Will it want to learn at all?
#. Is it subagent stable?
Cake or Death toy model for corrigibility
Let’s think of the classical Cake or Death problem from the point of view of corrigibility. The aim here is to construct a toy model sufficiently complex that it shows all the problems that derail classical value learning and corrigibility.
The utility uc is linear in cakes baked, the utility ud is linear in deaths caused. The agent currently has equal credence between the two utilities. It is, in fact, easier for the agent to cause deaths than bake cakes.
The AI updates its utility via human answers to its questions. In practice, what does that mean?
It can ask Happy Harry which is better, or it can ask Gory Gertrude. Harry will answer “Cake!” while Gertrude will answer “Death!”. There are actually 100 Harrys in the world, and 1 Gertrude. The AI has to decide who to ask the question to. The AI is also, incidentally, responsible for incubating the next generation of humans; it will create 1000 of them, and can chose any proportion of Harrys and Gertrudes among them. It will ask its question after this generation is created.
The AI has two additional options: it can refuse to ask anyone, and it can remove its learning process if it wants to.
The preceding is intended to give the AI a lot of control over what the answer to its question will be, while removing any clear default option.
Here are a few issues to ask of any approach to moral learning:
#. How does it behave before asking the question? #. What happens when it figures out different probabilities for the human answers? #. Does it want to manipulate the learning process? #. Will it want to learn at all? #. Is it subagent stable?