I think the retroactive editing of rewards (not just to punish explicitly bad action but to slightly improve evaluation of everything) is actually pretty default, though I understand if people disagree. It seems like an extremely natural thing to do that would make your AI more capable and make it more likely to pass most behavioral safety interventions.
In other words, even if the average episode length is short (e.g. 1 hour), I think the default outcome is to have the rewards for that episode be computed as far after the fact as possible, because that helps Alex improve at long-range planning (a skill Magma would try hard to get it to have). This can be done in a way that doesn’t compromise speed of training—you simply reward Alex immediately with your best guess reward, then keep editing it later as more information comes in. At all points in time you have a “good enough” reward ready to go, while also capturing the benefits of pushing your model to think in as long-term a way as possible.
Okay, it sounds like you know more about that than I do. It sounds like it might cause Alex to typically care about outcomes that are months in the future? That seems somewhat close to what it would take to be a major danger.
I think the retroactive editing of rewards (not just to punish explicitly bad action but to slightly improve evaluation of everything) is actually pretty default, though I understand if people disagree. It seems like an extremely natural thing to do that would make your AI more capable and make it more likely to pass most behavioral safety interventions.
In other words, even if the average episode length is short (e.g. 1 hour), I think the default outcome is to have the rewards for that episode be computed as far after the fact as possible, because that helps Alex improve at long-range planning (a skill Magma would try hard to get it to have). This can be done in a way that doesn’t compromise speed of training—you simply reward Alex immediately with your best guess reward, then keep editing it later as more information comes in. At all points in time you have a “good enough” reward ready to go, while also capturing the benefits of pushing your model to think in as long-term a way as possible.
Okay, it sounds like you know more about that than I do. It sounds like it might cause Alex to typically care about outcomes that are months in the future? That seems somewhat close to what it would take to be a major danger.