So the big problem I see with this it is still in the optimization framework, assuming that we actually want to optimize the initial criterion. While we can imagine changing the initial criterion, this is already something we can effectively do with RL if we specify our reward to be something communicated by a human overseer (but of course that doesn’t really solve the problem...)
The proposal is reminiscent of the Actor-Critic framework from RL (analogy: actor—model, critic—criterion), which learns a policy (the actor) and a value function (the critic) simultaneously.
In that case, you have the true reward function playing the role of the initial criterion, so you don’t actually get to evaluate the true criterion (which would be something like distance from the optimal policy), you get what amounts to noisy samples of it. The goal in both cases is to learn a good model (i.e. policy, for Actor-Critic).
I think there is a conceptual issue with this proposal as it stands, namely, the interplay between the changes in the model and criterion are not taken into account. E.g. there is no guarantee that recursively applying F to the initial_model using the criteria output by X would give you anything like the model output by X.
The cool thing about Actor-Critic is that you can prove (under suitable assumptions) that this method actually gives you an unbiased estimate of the true policy gradient (Sutton 99: https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf). IIRC, it requires the assumption that the critic is trained to convergence in-between each update of the actor, though.
So the big problem I see with this it is still in the optimization framework, assuming that we actually want to optimize the initial criterion. While we can imagine changing the initial criterion, this is already something we can effectively do with RL if we specify our reward to be something communicated by a human overseer (but of course that doesn’t really solve the problem...)
The proposal is reminiscent of the Actor-Critic framework from RL (analogy: actor—model, critic—criterion), which learns a policy (the actor) and a value function (the critic) simultaneously.
In that case, you have the true reward function playing the role of the initial criterion, so you don’t actually get to evaluate the true criterion (which would be something like distance from the optimal policy), you get what amounts to noisy samples of it. The goal in both cases is to learn a good model (i.e. policy, for Actor-Critic).
I think there is a conceptual issue with this proposal as it stands, namely, the interplay between the changes in the model and criterion are not taken into account. E.g. there is no guarantee that recursively applying F to the initial_model using the criteria output by X would give you anything like the model output by X.
The cool thing about Actor-Critic is that you can prove (under suitable assumptions) that this method actually gives you an unbiased estimate of the true policy gradient (Sutton 99: https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf). IIRC, it requires the assumption that the critic is trained to convergence in-between each update of the actor, though.