Reading the above, I am reminded of a similar exchange about the need for semantic structure between Alex Turner and me here, so I’d like to get to the bottom of this. Can you clarify your broader intuitions about the need or non-need for semantic structure? (Same question goes to Alex.)
Frankly, I expected you would have replied to Stuart’s comment with a statement like the following: ‘using semantic structure in impact measures is a valid approach, and it may be needed to encode certain values, but in this research we are looking at how far we can get by avoiding any semantic structure’. But I do not see that.
Instead, you seem to imply that leveraging semantic structure is never needed when further scaling impact measures. It looks like you feel that we can solve the alignment problem by looking exclusively at ‘model-free’ impact measures.
To make this more specific, take the following example. Suppose a mobile AGI agent has a choice between driving over one human, driving over P pigeons, or driving over C cats. Now, humans have very particular ideas about how they value the lives of humans, pigeons, and cats, and would expect that those ideas are reflected reasonably well in how the agent computes its impact measure. You seem to be saying that we can capture all this detail by just making the right tradeoffs between model-free terms, by just tuning some constants in terms that calculate ‘loss of options by driving over X’.
Is this really what you are saying?
I have done some work myself on loss-of-options impact measures (see e.g. section 12 of my recent paper here). My intuition about how far you can scale these ‘model-free’ techniques to produce human-morality-aligned safety properties in complex environments seems to be in complete disagreement with your comments and those made by Alex.
Penalize undesirable side-effects (running over 10 cats instead of 2 pigeons, breaking vases).
I tend to think about (1), because I think (2) is a lost cause without it—at least, if the agent is superhuman. I think that, if (1) is cleanly solvable, it will not require semantic structure. I also think that you can get a long ways towards (2) without explicit semantic structure (see Avoiding Side Effects in Complex Environments). Because I think (1) is so important, I’ve spent the last year trying to mathematically understand why power-seeking occurs.
That said, if we had a solution to (1) and fully wanted to solve (2), yes, I think we will need semantic structure. Certain effects are either (a) value-dependent (pigeons vs cats) or (b) too distant from the agent to count as option-loss-for-the-agent. For example, imagine penalizing breaking vases with AUP, but on the other side of the planet? They’ll never show up in optimal value calculations.
I think many people think of (2) when they consider impact measures, while I usually think about purpose (1). Hopefully this clarifies my views a bit.
Thanks for clarifying your view! I agree that for point 1 above, less semantic structure should be needed.
Reading some of the links above again, I still feel that we might be having different views on how much semantic structure is needed. But this also depends on what you count as semantic structure.
To clarify where I am coming from, I agree with the thesis of your paper Optimal Farsighted Agents Tend to Seek Power. I am not in the camp which, to quote the abstract of the paper, ‘voices scepticism’ about emergent power seeking incentives.
But me the, the main mechanism that turns power seeking incentives into catastrophic power-seeking is when at least two power-seeking entities with less than 100% aligned goals start to interact with each other in the same environment. So I am looking for semantic models that are rich enough to capture at least 2 players being present in the environment.
I have the feeling that you believe that moving to the 2-or-more-players level of semantic modelling is of lesser importance, is in fact a distraction, that we may be able to solve things cleanly enough if we just make every agent not seek power too much. Or maybe you are just prioritizing a deeper dive in that particular direction initially?
By “semantic models that are rich enough”, do you mean that the AI might need a semantic model for the power of other agents in the environment? There was an interesting paper about that idea. However, I think you shouldn’t need to account for other agents’ power directly—that the minimal viable design involves just having the AI not gain much power for itself.
I’m currently thinking more about power-seeking in the n-player setting, however.
By “semantic models that are rich enough”, do you mean that the AI might need a semantic model for the power of other agents in the environment?
Actually in my remarks above I am less concerned about how rich a model the AI may need. My main intuition is that we ourselves may need a semantic model for that describes the comparable power of several players, if our goal is to understand motivations towards power more deeply and generally.
To give a specific example from my own recent work: in working out more details about corrigibility and indifference, I ended up defining a safety property 2 (S2 in the paper) that is about control. Control is a form of power: if I control an agent’s future reward function, I have power over the agent, and indirect power over the resources it controls. To define safety property 2 mathematically, I had to make model extensions that I did not need to make to define or implement the reward function of the agent itself. So by analogy, if you want to understand and manage power seeking in an n-player setting, you may end up needing to define model extensions and metrics that are not present inside the reward functions or reasoning systems of each player. You may need them to measure, study, or define the nature of the solution.
The interesting paper you mention gives a kind-of example of such a metric, when it defines an equality metric for its battery collecting toy world, an equality metric that is not (explicitly represented) inside the agent’s own semantic model. For me, an important research challenge is to generalise such toy-world specific safety/low-impact metrics into metrics that can apply to all toy (and non-toy) world models.
Yet I do not see this generalisation step being done often, and I am still trying to find out why not. Partly I think I do not see it often because it is mathematically difficult. But I do not think that is the whole story. So that is one reason I have been asking opinions about semantic detail.
In one way, the interesting paper you mention goes in a direction that is directly counter to the one I feel is the most promising one. The paper explicitly frames its solution as a proposed modification of a specific deep Q-learning machine learning algorithm, not as an extension to the reward function that is being supplied to this machine learning algorithm. By implication, this means they add more semantic detail inside the machine learning code, while keeping it out of it out of the reward function. My preference is to extend the reward function if at all possible, because this produces solutions that will generalise better over current and future ML algorithms.
It was not my intention to imply that semantic structure is never needed—I was just saying that the pedestrian example does not indicate the need for semantic structure. I would generally like to minimize the use of semantic structure in impact measures, but I agree it’s unlikely we can get away without it.
There are some kinds of semantic structure that the agent can learn without explicit human input, e.g. by observing how humans have arranged the world (as in the RLSP paper). I think it’s plausible that agents can learn the semantic structure that’s needed for impact measures through unsupervised learning about the world, without relying on human input. This information could be incorporated in the weights assigned to reaching different states or satisfying different utility functions by the deviation measure (e.g. states where pigeons / cats are alive).
Thanks for the clarification, I think our intuitions about how far you could take these techniques may be more similar than was apparent from the earlier comments.
You bring up the distinction between semantic structure that is learned via unsupervised learning, and semantic structure that comes from ‘explicit human input’. We may be using the term ‘semantic structure’ in somewhat different ways when it comes to the question of how much semantic structure you are actually creating in certain setups.
If you set up things to create an impact metric via unsupervised learning, you still need to encode some kind of impact metric on the world state by hand, to go into the agents’s reward function, e.g. you may encode ‘bad impact’ as the observable signal ‘the owner of the agent presses the do-not-like feedback button’. For me, that setup uses a form of indirection to create an impact metric that is incredibly rich in semantic structure. It is incredibly rich because it indirectly incorporates the impact-related semantic structure knowledge that is in the owner’s brain. You might say instead that the metric does not have a rich of semantic structure at all, because it is just a bit from a button press. For me, an impact metric that is defined as ‘not too different from the world state that already exists’ would also encode a huge amount of semantic structure, in case the world we are talking about is not a toy world but the real world.
Reading the above, I am reminded of a similar exchange about the need for semantic structure between Alex Turner and me here, so I’d like to get to the bottom of this. Can you clarify your broader intuitions about the need or non-need for semantic structure? (Same question goes to Alex.)
Frankly, I expected you would have replied to Stuart’s comment with a statement like the following: ‘using semantic structure in impact measures is a valid approach, and it may be needed to encode certain values, but in this research we are looking at how far we can get by avoiding any semantic structure’. But I do not see that.
Instead, you seem to imply that leveraging semantic structure is never needed when further scaling impact measures. It looks like you feel that we can solve the alignment problem by looking exclusively at ‘model-free’ impact measures.
To make this more specific, take the following example. Suppose a mobile AGI agent has a choice between driving over one human, driving over P pigeons, or driving over C cats. Now, humans have very particular ideas about how they value the lives of humans, pigeons, and cats, and would expect that those ideas are reflected reasonably well in how the agent computes its impact measure. You seem to be saying that we can capture all this detail by just making the right tradeoffs between model-free terms, by just tuning some constants in terms that calculate ‘loss of options by driving over X’.
Is this really what you are saying?
I have done some work myself on loss-of-options impact measures (see e.g. section 12 of my recent paper here). My intuition about how far you can scale these ‘model-free’ techniques to produce human-morality-aligned safety properties in complex environments seems to be in complete disagreement with your comments and those made by Alex.
I think of impact measures as trying to do (at least) two things:
Stop catastrophic power-seeking behavior in AGIs, and
Penalize undesirable side-effects (running over 10 cats instead of 2 pigeons, breaking vases).
I tend to think about (1), because I think (2) is a lost cause without it—at least, if the agent is superhuman. I think that, if (1) is cleanly solvable, it will not require semantic structure. I also think that you can get a long ways towards (2) without explicit semantic structure (see Avoiding Side Effects in Complex Environments). Because I think (1) is so important, I’ve spent the last year trying to mathematically understand why power-seeking occurs.
That said, if we had a solution to (1) and fully wanted to solve (2), yes, I think we will need semantic structure. Certain effects are either (a) value-dependent (pigeons vs cats) or (b) too distant from the agent to count as option-loss-for-the-agent. For example, imagine penalizing breaking vases with AUP, but on the other side of the planet? They’ll never show up in optimal value calculations.
I think many people think of (2) when they consider impact measures, while I usually think about purpose (1). Hopefully this clarifies my views a bit.
Thanks for clarifying your view! I agree that for point 1 above, less semantic structure should be needed.
Reading some of the links above again, I still feel that we might be having different views on how much semantic structure is needed. But this also depends on what you count as semantic structure.
To clarify where I am coming from, I agree with the thesis of your paper Optimal Farsighted Agents Tend to Seek Power. I am not in the camp which, to quote the abstract of the paper, ‘voices scepticism’ about emergent power seeking incentives.
But me the, the main mechanism that turns power seeking incentives into catastrophic power-seeking is when at least two power-seeking entities with less than 100% aligned goals start to interact with each other in the same environment. So I am looking for semantic models that are rich enough to capture at least 2 players being present in the environment.
I have the feeling that you believe that moving to the 2-or-more-players level of semantic modelling is of lesser importance, is in fact a distraction, that we may be able to solve things cleanly enough if we just make every agent not seek power too much. Or maybe you are just prioritizing a deeper dive in that particular direction initially?
By “semantic models that are rich enough”, do you mean that the AI might need a semantic model for the power of other agents in the environment? There was an interesting paper about that idea. However, I think you shouldn’t need to account for other agents’ power directly—that the minimal viable design involves just having the AI not gain much power for itself.
I’m currently thinking more about power-seeking in the n-player setting, however.
Actually in my remarks above I am less concerned about how rich a model the AI may need. My main intuition is that we ourselves may need a semantic model for that describes the comparable power of several players, if our goal is to understand motivations towards power more deeply and generally.
To give a specific example from my own recent work: in working out more details about corrigibility and indifference, I ended up defining a safety property 2 (S2 in the paper) that is about control. Control is a form of power: if I control an agent’s future reward function, I have power over the agent, and indirect power over the resources it controls. To define safety property 2 mathematically, I had to make model extensions that I did not need to make to define or implement the reward function of the agent itself. So by analogy, if you want to understand and manage power seeking in an n-player setting, you may end up needing to define model extensions and metrics that are not present inside the reward functions or reasoning systems of each player. You may need them to measure, study, or define the nature of the solution.
The interesting paper you mention gives a kind-of example of such a metric, when it defines an equality metric for its battery collecting toy world, an equality metric that is not (explicitly represented) inside the agent’s own semantic model. For me, an important research challenge is to generalise such toy-world specific safety/low-impact metrics into metrics that can apply to all toy (and non-toy) world models.
Yet I do not see this generalisation step being done often, and I am still trying to find out why not. Partly I think I do not see it often because it is mathematically difficult. But I do not think that is the whole story. So that is one reason I have been asking opinions about semantic detail.
In one way, the interesting paper you mention goes in a direction that is directly counter to the one I feel is the most promising one. The paper explicitly frames its solution as a proposed modification of a specific deep Q-learning machine learning algorithm, not as an extension to the reward function that is being supplied to this machine learning algorithm. By implication, this means they add more semantic detail inside the machine learning code, while keeping it out of it out of the reward function. My preference is to extend the reward function if at all possible, because this produces solutions that will generalise better over current and future ML algorithms.
It was not my intention to imply that semantic structure is never needed—I was just saying that the pedestrian example does not indicate the need for semantic structure. I would generally like to minimize the use of semantic structure in impact measures, but I agree it’s unlikely we can get away without it.
There are some kinds of semantic structure that the agent can learn without explicit human input, e.g. by observing how humans have arranged the world (as in the RLSP paper). I think it’s plausible that agents can learn the semantic structure that’s needed for impact measures through unsupervised learning about the world, without relying on human input. This information could be incorporated in the weights assigned to reaching different states or satisfying different utility functions by the deviation measure (e.g. states where pigeons / cats are alive).
Thanks for the clarification, I think our intuitions about how far you could take these techniques may be more similar than was apparent from the earlier comments.
You bring up the distinction between semantic structure that is learned via unsupervised learning, and semantic structure that comes from ‘explicit human input’. We may be using the term ‘semantic structure’ in somewhat different ways when it comes to the question of how much semantic structure you are actually creating in certain setups.
If you set up things to create an impact metric via unsupervised learning, you still need to encode some kind of impact metric on the world state by hand, to go into the agents’s reward function, e.g. you may encode ‘bad impact’ as the observable signal ‘the owner of the agent presses the do-not-like feedback button’. For me, that setup uses a form of indirection to create an impact metric that is incredibly rich in semantic structure. It is incredibly rich because it indirectly incorporates the impact-related semantic structure knowledge that is in the owner’s brain. You might say instead that the metric does not have a rich of semantic structure at all, because it is just a bit from a button press. For me, an impact metric that is defined as ‘not too different from the world state that already exists’ would also encode a huge amount of semantic structure, in case the world we are talking about is not a toy world but the real world.