I had an idea for fighting goal misgeneralization. Doesn’t seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:
Use IRL to learn which values are consistent with the actor’s behavior.
When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL.
That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)
I had an idea for fighting goal misgeneralization. Doesn’t seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:
Use IRL to learn which values are consistent with the actor’s behavior.
When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)
I probably don’t understand the shortform format, but it seem like others can’t create top-level comments. So you can comment here :)