Agreement karma indicates agreement, separate from overall quality.
I had an idea for fighting goal misgeneralization. Doesn’t seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:
Use IRL to learn which values are consistent with the actor’s behavior.
When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL.
That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)
1 vote
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
I had an idea for fighting goal misgeneralization. Doesn’t seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:
Use IRL to learn which values are consistent with the actor’s behavior.
When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)
1 vote
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
I probably don’t understand the shortform format, but it seem like others can’t create top-level comments. So you can comment here :)