(ML researchers) We still don’t have a robust solution to specification gaming: powerful agents find ways to get high reward, but not in the way you’d want. Sure, you can tweak your objective, add rules, but this doesn’t solve the core problem, that your agent doesn’t seek what you want, only a rough operational translation.
What would a high-fidelity translation would look like? How would create a system that doesn’t try to game you?
(ML researchers) We still don’t have a robust solution to specification gaming: powerful agents find ways to get high reward, but not in the way you’d want. Sure, you can tweak your objective, add rules, but this doesn’t solve the core problem, that your agent doesn’t seek what you want, only a rough operational translation.
What would a high-fidelity translation would look like? How would create a system that doesn’t try to game you?