My guess is that the default outcome is that it’s easier to train Alex to care about near-term results, and harder to get Alex to care about the distant future. Especially in a “racing forward” scenario, I’d expect Magma to prioritize training that can be done quickly.
So I expect the most likely result is for Alex to be more myopic than the humans who run Magma. This seems somewhat at odds with capabilities you imply.
That guess is heavily dependent on details that you haven’t specified. There are likely ways that Alex could be made more far-sighted than humans. CarlShulman’s point about retroactive punishment suggests one possibility, but it doesn’t seem close to being the default.
I’m not suggesting you add those details. The post is a bit long as it is. My advice is to be more uncertain than you are (at least in the introduction to this post) whether the default outcome is AI takeover. That’s a minor complaint about a mostly great post.
I think the retroactive editing of rewards (not just to punish explicitly bad action but to slightly improve evaluation of everything) is actually pretty default, though I understand if people disagree. It seems like an extremely natural thing to do that would make your AI more capable and make it more likely to pass most behavioral safety interventions.
In other words, even if the average episode length is short (e.g. 1 hour), I think the default outcome is to have the rewards for that episode be computed as far after the fact as possible, because that helps Alex improve at long-range planning (a skill Magma would try hard to get it to have). This can be done in a way that doesn’t compromise speed of training—you simply reward Alex immediately with your best guess reward, then keep editing it later as more information comes in. At all points in time you have a “good enough” reward ready to go, while also capturing the benefits of pushing your model to think in as long-term a way as possible.
Okay, it sounds like you know more about that than I do. It sounds like it might cause Alex to typically care about outcomes that are months in the future? That seems somewhat close to what it would take to be a major danger.
How far-sighted would Alex be?
My guess is that the default outcome is that it’s easier to train Alex to care about near-term results, and harder to get Alex to care about the distant future. Especially in a “racing forward” scenario, I’d expect Magma to prioritize training that can be done quickly.
So I expect the most likely result is for Alex to be more myopic than the humans who run Magma. This seems somewhat at odds with capabilities you imply.
That guess is heavily dependent on details that you haven’t specified. There are likely ways that Alex could be made more far-sighted than humans. CarlShulman’s point about retroactive punishment suggests one possibility, but it doesn’t seem close to being the default.
I’m not suggesting you add those details. The post is a bit long as it is. My advice is to be more uncertain than you are (at least in the introduction to this post) whether the default outcome is AI takeover. That’s a minor complaint about a mostly great post.
I think the retroactive editing of rewards (not just to punish explicitly bad action but to slightly improve evaluation of everything) is actually pretty default, though I understand if people disagree. It seems like an extremely natural thing to do that would make your AI more capable and make it more likely to pass most behavioral safety interventions.
In other words, even if the average episode length is short (e.g. 1 hour), I think the default outcome is to have the rewards for that episode be computed as far after the fact as possible, because that helps Alex improve at long-range planning (a skill Magma would try hard to get it to have). This can be done in a way that doesn’t compromise speed of training—you simply reward Alex immediately with your best guess reward, then keep editing it later as more information comes in. At all points in time you have a “good enough” reward ready to go, while also capturing the benefits of pushing your model to think in as long-term a way as possible.
Okay, it sounds like you know more about that than I do. It sounds like it might cause Alex to typically care about outcomes that are months in the future? That seems somewhat close to what it would take to be a major danger.