A simple and easy design for a u-maximising agent that turns into a u-minimising one.
Let X be some boolean random variable outside the agent’s control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ϵ, and for a given utility u consider the utility:
u#=(2/ϵ)Xu−u.
Before t, the expected value of (2/ϵ)X is 2, so u#=u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u# into a u-minimiser.
This isn’t perfect corrigibility—the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:
u#=Ω(2/ϵ)Xu−u.
If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.
One weird trick to turn maximisers into minimisers
A putative new idea for AI control; index here.
A simple and easy design for a u-maximising agent that turns into a u-minimising one.
Let X be some boolean random variable outside the agent’s control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ϵ, and for a given utility u consider the utility:
u#=(2/ϵ)Xu−u.
Before t, the expected value of (2/ϵ)X is 2, so u#=u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u# into a u-minimiser.
This isn’t perfect corrigibility—the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:
u#=Ω(2/ϵ)Xu−u.
If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.