One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:
use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
train a new system to do as well on the reward function as the original system
measure the number of training steps needed to reach this point for the new system.
This would let you make comparisons between different systems as to which was more capability robust.
Maybe there’s a version that could train the new system using behavioural cloning, but it’s less clear how you measure when you’re as competent as the original agent (maybe using a discriminator?)
The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems’s ontology and capabilities.
One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:
use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
train a new system to do as well on the reward function as the original system
measure the number of training steps needed to reach this point for the new system.
This would let you make comparisons between different systems as to which was more capability robust.
Maybe there’s a version that could train the new system using behavioural cloning, but it’s less clear how you measure when you’re as competent as the original agent (maybe using a discriminator?)
The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems’s ontology and capabilities.