I’m not sure if I fully understood the section on machine learning. Is the main idea that you just apply the indifference correction at every timestep, so that the agent always acts as if it believes that use of the terminal does nothing?
Yes, the learning agent also applies the indifference-creating balancing term at each time step. I am not sure if there is a single main idea that summarizes the learning agent design—if there had been a single main idea then the section might have been shorter. In creating the learning agent design I combined several ideas and techniques, and tweaked their interactions until I had something that provably satisfies the safety properties.
What about the issue that “the terminal does nothing” is actually a fact that has impacts on the world, which might produce a signal in the training data?
As a general rule, inside the training data gathered during previous time steps, it will be very visible that the signals coming from the input terminal, and any changes in them, will have an effect on the agent’s actions.
This is not a problem, but to illustrate why not I will first describe an alternative learning agent design where it would be a problem. Consider a model-free advanced Q-learning type agent, which uses the decision making policy of ‘do more of what earlier versions of myself did when they got high reward signals’. If such an agent has the Rbaseline container reward function defined in the paper, then if the training record implies the existence of attractive wireheading options, these might well be used. If the Q-learner has the Rsl container reward function, then the policy process end up with an emergent drive to revert any updates made via the input terminal, so that the agent gets back to a set of world states which are more familiar territory. The agent might also want to block updates, for the same reason. But the learning agent in the paper does not use this Q-learning type of decision making policy.
The agent in the paper takes actions using a reasoning process that is different from `do what earlier versions of myself did when..‘. Before I try to describe it, first a disclaimer. Natural language analogies to human reasoning are a blunt tool for describing what happens in the learning agent: this agent has too many moving and partially self-referential parts inside to capture them all in a single long sentence. That being said, the learning agent’s planning process is like ‘do what a hypothetical agent would do, in the world you have learned about, under the counterfactual assumption that the payload reward function of that hypothetical agent will never change, no matter what happens at the input terminal you also learned about’.
In section 11 of the paper, I describe the above planning process as creating a form of bureaucratic blindness. By design, the process simply ignores some of the information in the training record: this information is simply not relevant to maximizing the utility that needs to be maximized.
The analogy is that if you tell a robot that not moving is safe behavior, and get it to predict what happens in safe situations, it will include a lot of “the humans get confused why the robot isn’t moving and try to fix it” in its predictions. If the terminal actually does nothing, humans who just used the terminal will see that, and will try to fix the robot, as it were. This creates an incentive to avoid situations where the terminal is used, even if it’s predicted that it does nothing.
I think what you are describing above is the residual manipulation incentive in the second toy world of section 6 of the paper. This problem also exists for optimal-policy agents that have `nothing left to learn’, so it is an emergent effect that is unrelated to machine learning.
Thanks!
Yes, the learning agent also applies the indifference-creating balancing term at each time step. I am not sure if there is a single main idea that summarizes the learning agent design—if there had been a single main idea then the section might have been shorter. In creating the learning agent design I combined several ideas and techniques, and tweaked their interactions until I had something that provably satisfies the safety properties.
As a general rule, inside the training data gathered during previous time steps, it will be very visible that the signals coming from the input terminal, and any changes in them, will have an effect on the agent’s actions.
This is not a problem, but to illustrate why not I will first describe an alternative learning agent design where it would be a problem. Consider a model-free advanced Q-learning type agent, which uses the decision making policy of ‘do more of what earlier versions of myself did when they got high reward signals’. If such an agent has the Rbaseline container reward function defined in the paper, then if the training record implies the existence of attractive wireheading options, these might well be used. If the Q-learner has the Rsl container reward function, then the policy process end up with an emergent drive to revert any updates made via the input terminal, so that the agent gets back to a set of world states which are more familiar territory. The agent might also want to block updates, for the same reason. But the learning agent in the paper does not use this Q-learning type of decision making policy.
The agent in the paper takes actions using a reasoning process that is different from `do what earlier versions of myself did when..‘. Before I try to describe it, first a disclaimer. Natural language analogies to human reasoning are a blunt tool for describing what happens in the learning agent: this agent has too many moving and partially self-referential parts inside to capture them all in a single long sentence. That being said, the learning agent’s planning process is like ‘do what a hypothetical agent would do, in the world you have learned about, under the counterfactual assumption that the payload reward function of that hypothetical agent will never change, no matter what happens at the input terminal you also learned about’.
In section 11 of the paper, I describe the above planning process as creating a form of bureaucratic blindness. By design, the process simply ignores some of the information in the training record: this information is simply not relevant to maximizing the utility that needs to be maximized.
I think what you are describing above is the residual manipulation incentive in the second toy world of section 6 of the paper. This problem also exists for optimal-policy agents that have `nothing left to learn’, so it is an emergent effect that is unrelated to machine learning.