Good job for independent exploration! When I went down this rabbit hole, I got stuck on “how do you specify long-term-useful sub tasks with no long-term constraints?”
In particular, you need to rely on something like value learning having already happened, to prevent the agent from doing things that are short-term good but long-term disastrous. (E.g. building a skyscraper that will immediately collapse in a human-undetectable way.)
But I agree that, modulo what you and others have listed, this approach meaningfully bound agents. Certainly, it should be the default starting point for an iterative alignment strategy.
Good job for independent exploration! When I went down this rabbit hole, I got stuck on “how do you specify long-term-useful sub tasks with no long-term constraints?” In particular, you need to rely on something like value learning having already happened, to prevent the agent from doing things that are short-term good but long-term disastrous. (E.g. building a skyscraper that will immediately collapse in a human-undetectable way.) But I agree that, modulo what you and others have listed, this approach meaningfully bound agents. Certainly, it should be the default starting point for an iterative alignment strategy.