Since we probably can’t specify a reward function by hand, one way to get an agent that does what we want is to have it imitate a human. As long as it does this faithfully, it is as safe as the human it is imitating. However, in a train-test paradigm, the resulting agent may faithfully imitate the human on the training distribution but fail catastrophically on the test distribution. (For example, a deceptive model might imitate faithfully until it has sufficient power to take over.) One solution is to never stop training, that is, use an online learning setup where the agent is constantly learning from the demonstrator.
There are a few details to iron out. The agent needs to reduce the frequency with which it queries the demonstrator (otherwise we might as well just have the demonstrator do the work). Crucially, we need to ensure that the agent will never do something that the demonstrator wouldn’t have done, because such an action could be arbitrarily bad.
This paper proposes a solution in the paradigm where we use Bayesian updating rather than gradient descent to select our model, that is, we have a prior over possible models and then when we see a demonstrator action we update our distribution appropriately. In this case, at every timestep we take the N most probable models, and only take an action a with probability p if **every** one of the N models takes that action with at least probability p. The total probability of all the actions will typically be less than 1 -- the remaining probability is assigned to querying the demonstrator.
The key property here is that as long as the true demonstrator is in the top N models, then the agent never autonomously takes an action with more probability than the demonstrator would. Therefore as long as we believe the demonstrator is safe then the agent should be as well. Since the agent learns more about the demonstrator every time it queries them, over time it needs to query the demonstrator less often. Note that the higher N is, the more likely it is that the true model is one of those N models (and thus we have more safety), but also the more likely it is that we will have to query the demonstrator. This tradeoff is controlled by a hyperparameter α that implicitly determines N.
Opinion:
One of the most important approaches to improve inner alignment is to monitor the performance of your system online, and train to correct any problems. This paper shows the benefit of explicitly quantified, well-specified uncertainty: it allows you to detect problems _before they happen_ and then correct for them.
This setting has also been studied in <@delegative RL@>(@Delegative Reinforcement Learning@), though there the agent also has access to a reward signal in addition to a demonstrator.
In this case, at every timestep we take the N most probable models, and only take an action a with probability p if **every** one of the N models takes that action with at least probability p.
This is so much clearer than I’ve ever put it.
(There’s a specific rule that ensures that N decreases over time.)
N won’t necessarily decrease over time, but all of the models will eventually agree with other.
monitor the performance of your system online, and train to correct any problems
I would have described Vanessa’s and my approaches as more about monitoring uncertainty, and avoiding problems before the fact rather than correcting them afterward. But I think what you said stands too.
N won’t necessarily decrease over time, but all of the models will eventually agree with other.
Ah, right. I rewrote that paragraph, getting rid of that sentence and instead talking about the tradeoff directly.
I would have described Vanessa’s and my approaches as more about monitoring uncertainty, and avoiding problems before the fact rather than correcting them afterward. But I think what you said stands too.
Added a sentence to the opinion noting the benefits of explicitly quantified uncertainty.
Planned summary:
Opinion:
This is so much clearer than I’ve ever put it.
N won’t necessarily decrease over time, but all of the models will eventually agree with other.
I would have described Vanessa’s and my approaches as more about monitoring uncertainty, and avoiding problems before the fact rather than correcting them afterward. But I think what you said stands too.
Ah, right. I rewrote that paragraph, getting rid of that sentence and instead talking about the tradeoff directly.
Added a sentence to the opinion noting the benefits of explicitly quantified uncertainty.