Yes, I like this one. We don’t want the AI to find a way to give it self utility while making things worse for us. And if we are trying to make things better for us, we don’t want the AI to resist us.
Do you want to find out what these inequalities implies about the utility functions? Can you find examples where your condition is true for non-identical functions?
UA changes only when MA (A’s world model) changes which is ultimately caused by new observations, i.e. changes in the world state (let’s assume that both A and B perceive the world quite accurately).
If whenever UA changes UB doesn’t decrease, then whatever change in the world increased UA, B at least doesn’t care. This is problematic when A and B need the same scarce resources (instrumental convergence etc). It could be satisfied if they were both satisficers or bounded agents inhabiting significantly disjoint niches.
A robust solution seems seems to be to make (super accurately modeled) UB a major input to UA.
Then (I think) for your inequality to hold, it must be that
U_B = f(3x+y), where f’ >= 0
If U_B care about x and y in any other proportion, then B can make trade-offs between x and y which makes things better for B, but worse for A.
This will be true (in theory) even if both A and B are satisfisers. You can see this by assuming replacing y and x with sigmoids of some other variables.
Yes, I like this one. We don’t want the AI to find a way to give it self utility while making things worse for us. And if we are trying to make things better for us, we don’t want the AI to resist us.
Do you want to find out what these inequalities implies about the utility functions? Can you find examples where your condition is true for non-identical functions?
I don’t have a specific example right now but some things that come to mind:
Both utility functions ultimately depend in some way on a subset of background conditions, i.e. the world state
The world state influences the utility functions through latent variables in the agents’ world models, to which they are inputs.
UA changes only when MA (A’s world model) changes which is ultimately caused by new observations, i.e. changes in the world state (let’s assume that both A and B perceive the world quite accurately).
If whenever UA changes UB doesn’t decrease, then whatever change in the world increased UA, B at least doesn’t care. This is problematic when A and B need the same scarce resources (instrumental convergence etc). It could be satisfied if they were both satisficers or bounded agents inhabiting significantly disjoint niches.
A robust solution seems seems to be to make (super accurately modeled) UB a major input to UA.
Lets say that
U_A = 3x + y
Then (I think) for your inequality to hold, it must be that
U_B = f(3x+y), where f’ >= 0
If U_B care about x and y in any other proportion, then B can make trade-offs between x and y which makes things better for B, but worse for A.
This will be true (in theory) even if both A and B are satisfisers. You can see this by assuming replacing y and x with sigmoids of some other variables.