TurnTrout comments on Tradeoff between desirable properties for baseline choices in impact measures

TurnTrout 19 Jul 2020 15:40 UTC
LW: 2 AF: 1
AF
By “semantic models that are rich enough”, do you mean that the AI might need a semantic model for the power of other agents in the environment? There was an interesting paper about that idea. However, I think you shouldn’t need to account for other agents’ power directly—that the minimal viable design involves just having the AI not gain much power for itself.
I’m currently thinking more about power-seeking in the $n$ -player setting, however.
- Koen.Holtman 21 Jul 2020 13:23 UTC
  LW: 3 AF: 2
  AF Parent
  By “semantic models that are rich enough”, do you mean that the AI might need a semantic model for the power of other agents in the environment?
  Actually in my remarks above I am less concerned about how rich a model the AI may need. My main intuition is that we ourselves may need a semantic model for that describes the comparable power of several players, if our goal is to understand motivations towards power more deeply and generally.
  To give a specific example from my own recent work: in working out more details about corrigibility and indifference, I ended up defining a safety property 2 (S2 in the paper) that is about control. Control is a form of power: if I control an agent’s future reward function, I have power over the agent, and indirect power over the resources it controls. To define safety property 2 mathematically, I had to make model extensions that I did not need to make to define or implement the reward function of the agent itself. So by analogy, if you want to understand and manage power seeking in an n-player setting, you may end up needing to define model extensions and metrics that are not present inside the reward functions or reasoning systems of each player. You may need them to measure, study, or define the nature of the solution.
  The interesting paper you mention gives a kind-of example of such a metric, when it defines an equality metric for its battery collecting toy world, an equality metric that is not (explicitly represented) inside the agent’s own semantic model. For me, an important research challenge is to generalise such toy-world specific safety/low-impact metrics into metrics that can apply to all toy (and non-toy) world models.
  Yet I do not see this generalisation step being done often, and I am still trying to find out why not. Partly I think I do not see it often because it is mathematically difficult. But I do not think that is the whole story. So that is one reason I have been asking opinions about semantic detail.
  In one way, the interesting paper you mention goes in a direction that is directly counter to the one I feel is the most promising one. The paper explicitly frames its solution as a proposed modification of a specific deep Q-learning machine learning algorithm, not as an extension to the reward function that is being supplied to this machine learning algorithm. By implication, this means they add more semantic detail inside the machine learning code, while keeping it out of it out of the reward function. My preference is to extend the reward function if at all possible, because this produces solutions that will generalise better over current and future ML algorithms.