Consider ‘poor man’s RL’ aka expert iteration. Do a bunch of rollouts/trajectories, evaluate/score them, pick the top 30%, and then throw them into the hopper to train on (in the usual pretraining / imitation learning sense). So the Shoggoth would train on the shoggoth-outputs of the 30% most successful rollouts, and the Face would train on the face-outputs of the 30% most successful rollouts.
More advanced RL algorithms… well, we’d have to take them on a case by case basis but I’m currently expecting many or even most of them to work just fine here.
Consider ‘poor man’s RL’ aka expert iteration. Do a bunch of rollouts/trajectories, evaluate/score them, pick the top 30%, and then throw them into the hopper to train on (in the usual pretraining / imitation learning sense). So the Shoggoth would train on the shoggoth-outputs of the 30% most successful rollouts, and the Face would train on the face-outputs of the 30% most successful rollouts.
More advanced RL algorithms… well, we’d have to take them on a case by case basis but I’m currently expecting many or even most of them to work just fine here.