Daniel Kokotajlo comments on Why Don’t We Just… Shoggoth+Face+Paraphraser?

Daniel Kokotajlo 20 Nov 2024 1:47 UTC
4 points
2
Consider ‘poor man’s RL’ aka expert iteration. Do a bunch of rollouts/trajectories, evaluate/score them, pick the top 30%, and then throw them into the hopper to train on (in the usual pretraining / imitation learning sense). So the Shoggoth would train on the shoggoth-outputs of the 30% most successful rollouts, and the Face would train on the face-outputs of the 30% most successful rollouts.

More advanced RL algorithms… well, we’d have to take them on a case by case basis but I’m currently expecting many or even most of them to work just fine here.