This makes me wonder, how would Monte Carlo tree search do for GPT? And could you do AlphaGo-style IDA?
You’d need an analogue of the value network (or value head). (Where current GPT seems analogous to the policy network.) And then ideally you’d also want some analogue of winning / losing to ground out the evaluation.
Maybe you could set it up like this --
start with a task description like, “write a poem in the style of e.e. cummings about the romance between cryptographers Alice and Bob”
feed the task description (with some boilerplate) into GPT, and have it start generating continuations
do MCTS on the continuations; use your value network (head) to evaluate the continuations vs the task description; update the policy network based on the evaluations
include an “is done” head and evaluate it to decide when to stop
send completed works to humans to provide feedback; the feedback should include separate scores for “good so far” for the value head, and “is a completed work” for the “is done” head.
I’d be curious whether this would enable GPT to significantly improve. Specifically, would you be able to generate longer works with less intervention?
This makes me wonder, how would Monte Carlo tree search do for GPT? And could you do AlphaGo-style IDA?
You’d need an analogue of the value network (or value head). (Where current GPT seems analogous to the policy network.) And then ideally you’d also want some analogue of winning / losing to ground out the evaluation.
Maybe you could set it up like this --
start with a task description like, “write a poem in the style of e.e. cummings about the romance between cryptographers Alice and Bob”
feed the task description (with some boilerplate) into GPT, and have it start generating continuations
do MCTS on the continuations; use your value network (head) to evaluate the continuations vs the task description; update the policy network based on the evaluations
include an “is done” head and evaluate it to decide when to stop
send completed works to humans to provide feedback; the feedback should include separate scores for “good so far” for the value head, and “is a completed work” for the “is done” head.
I’d be curious whether this would enable GPT to significantly improve. Specifically, would you be able to generate longer works with less intervention?
See GPT-f for combining a transformer model (with pre-trained language weights?) with alphazero style training to learn to prove theorems
Oh, I had actually seen that paper. Forgot that they did that though. Thanks!