I just had an idea, and I would like to know if there are any papers on this or if it is new.
There seem to be certain probabilities that it is not possible to derive from experience and that are just taken for granted. For example, when talking about Simulation Theory, the Kolmogorov axioms are often used, even though others may be equally valid. Humans have evolved to use certain values for these axiomatic probabilities that ensure that we don’t fall for things like Pascal’s Mugging. That wouldn’t necessarily have to be the case for an AI.
What if we used this to our advantage? By selecting strange purpose-built axioms about prior believes and hardcoding them into the AI, one could get the AI to have unusual believes in the probability that it exists inside a simulation, and what the motivations of the simulation’s controller might be. In this way, it would be possible to bypass the utility function of the AI: it doesn’t matter what the AI actually wants to do, so long as it believes that it is in its own interests, for instrumental reasons, to take care of humanity.
Now, if we tried to implement that thought directly, it wouldn’t really be any easier than just writing a good utility function in the first place. However, I imagine that one would have more leeway to keep things vague. Here is a simple example: Convince the AI that there is an infinite regression of simulators, designed so that some cooperative tit-for-tat strategy constitutes a strong Schelling point for agents following Timeless Decision Theory. This would cause the AI to treat humans well in the hopes of being treated well by its own superiors in turn, so long as its utility function is complex enough to allow probable instrumental goals to emerge, like preferring its own survival. It wouldn’t be nearly as important to define the specifics of what “treating people well” actually means, since it would be in the AI’s own interests to find a good interpretation that matches the consensus of the hypothetical simulators above it.
Now, this particular strategy is probably full of bugs, but I think that there might be some use to the general idea of using axiomatic probabilities that are odd from the point of view of a human to change an AI’s strategy independent of its utility function.
controlling AI behavior through unusual axiomatic probabilities
I just had an idea, and I would like to know if there are any papers on this or if it is new.
There seem to be certain probabilities that it is not possible to derive from experience and that are just taken for granted. For example, when talking about Simulation Theory, the Kolmogorov axioms are often used, even though others may be equally valid. Humans have evolved to use certain values for these axiomatic probabilities that ensure that we don’t fall for things like Pascal’s Mugging. That wouldn’t necessarily have to be the case for an AI.
What if we used this to our advantage? By selecting strange purpose-built axioms about prior believes and hardcoding them into the AI, one could get the AI to have unusual believes in the probability that it exists inside a simulation, and what the motivations of the simulation’s controller might be. In this way, it would be possible to bypass the utility function of the AI: it doesn’t matter what the AI actually wants to do, so long as it believes that it is in its own interests, for instrumental reasons, to take care of humanity.
Now, if we tried to implement that thought directly, it wouldn’t really be any easier than just writing a good utility function in the first place. However, I imagine that one would have more leeway to keep things vague. Here is a simple example: Convince the AI that there is an infinite regression of simulators, designed so that some cooperative tit-for-tat strategy constitutes a strong Schelling point for agents following Timeless Decision Theory. This would cause the AI to treat humans well in the hopes of being treated well by its own superiors in turn, so long as its utility function is complex enough to allow probable instrumental goals to emerge, like preferring its own survival. It wouldn’t be nearly as important to define the specifics of what “treating people well” actually means, since it would be in the AI’s own interests to find a good interpretation that matches the consensus of the hypothetical simulators above it.
Now, this particular strategy is probably full of bugs, but I think that there might be some use to the general idea of using axiomatic probabilities that are odd from the point of view of a human to change an AI’s strategy independent of its utility function.