Here’s a project idea that I wish someone would pick up (written as a shortform rather than as a post because that’s much easier for me):
It would be nice to study competent misgeneralization empirically, to give examples and maybe help us develop theory around it.
Problem: how do you measure ‘competence’ without reference to a goal??
Prior work has used the ‘agents vs devices’ framework, where you have a distribution over all reward functions, some likelihood distribution over what ‘real agents’ would do given a certain reward function, and do Bayesian inference on that vs choosing actions randomly. If conditioned on your behaviour you’re probably an agent rather than a random actor, then you’re competent.
I don’t like this:
Crucially relies on knowing the space of reward functions that the learner in question might have.
Crucially relies on knowing how agents act given certain motivations.
A priori it’s not so obvious why we care about this metric.
Here’s another option: throw out ‘competence’ and talk about ‘consequential’.
This has a name collision with ‘consequentialist’ that you’ll probably have to fix but whatever.
The setup: you have your learner do stuff in a multi-agent environment. You use the AUP metric on every agent other than your learner. You say that your learner is ‘consequential’ if it strongly affects the attainable utility of other agents.
How good is this?
It still relies on having a space of reward functions, but there’s some more wiggle-room: you probably don’t need to get the space exactly right, just to have goals that are similar to yours.
Note that this would no longer be true if this were a metric you were optimizing over.
You still need to have some idea about how agents will act realistically, because if you only look at the utility attainable by optimal policies, that might elide the fact that it’s suddenly gotten much computationally harder to achieve that utility.
That said, I still feel like this is going to degrade more gracefully, as long as you include models that are roughly right. I guess this is because this model is no longer a likelihood ratio where misspecification can just rule out the right answer.
It’s more obvious why we care about this metric.
Bonus round: you can probably do some thinking about why various setups would tend to reduce other agents’ attainable utility, prove some little theorems, etc., in the style of the power-seeking paper.
Ideally you could even show a relation between this and the agents vs devices framing.
I think this is the sort of project a first-year PhD student could fruitfully make progress on.
Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.
Here’s a project idea that I wish someone would pick up (written as a shortform rather than as a post because that’s much easier for me):
It would be nice to study competent misgeneralization empirically, to give examples and maybe help us develop theory around it.
Problem: how do you measure ‘competence’ without reference to a goal??
Prior work has used the ‘agents vs devices’ framework, where you have a distribution over all reward functions, some likelihood distribution over what ‘real agents’ would do given a certain reward function, and do Bayesian inference on that vs choosing actions randomly. If conditioned on your behaviour you’re probably an agent rather than a random actor, then you’re competent.
I don’t like this:
Crucially relies on knowing the space of reward functions that the learner in question might have.
Crucially relies on knowing how agents act given certain motivations.
A priori it’s not so obvious why we care about this metric.
Here’s another option: throw out ‘competence’ and talk about ‘consequential’.
This has a name collision with ‘consequentialist’ that you’ll probably have to fix but whatever.
The setup: you have your learner do stuff in a multi-agent environment. You use the AUP metric on every agent other than your learner. You say that your learner is ‘consequential’ if it strongly affects the attainable utility of other agents.
How good is this?
It still relies on having a space of reward functions, but there’s some more wiggle-room: you probably don’t need to get the space exactly right, just to have goals that are similar to yours.
Note that this would no longer be true if this were a metric you were optimizing over.
You still need to have some idea about how agents will act realistically, because if you only look at the utility attainable by optimal policies, that might elide the fact that it’s suddenly gotten much computationally harder to achieve that utility.
That said, I still feel like this is going to degrade more gracefully, as long as you include models that are roughly right. I guess this is because this model is no longer a likelihood ratio where misspecification can just rule out the right answer.
It’s more obvious why we care about this metric.
Bonus round: you can probably do some thinking about why various setups would tend to reduce other agents’ attainable utility, prove some little theorems, etc., in the style of the power-seeking paper.
Ideally you could even show a relation between this and the agents vs devices framing.
I think this is the sort of project a first-year PhD student could fruitfully make progress on.
Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.