ESRogs comments on Why GPT wants to mesa-optimize & how we might change this

ESRogs 21 Sep 2020 20:39 UTC
8 points
This makes me wonder, how would Monte Carlo tree search do for GPT? And could you do AlphaGo-style IDA?

You’d need an analogue of the value network (or value head). (Where current GPT seems analogous to the policy network.) And then ideally you’d also want some analogue of winning / losing to ground out the evaluation.

Maybe you could set it up like this --
1. start with a task description like, “write a poem in the style of e.e. cummings about the romance between cryptographers Alice and Bob”
2. feed the task description (with some boilerplate) into GPT, and have it start generating continuations
3. do MCTS on the continuations; use your value network (head) to evaluate the continuations vs the task description; update the policy network based on the evaluations
4. include an “is done” head and evaluate it to decide when to stop
5. send completed works to humans to provide feedback; the feedback should include separate scores for “good so far” for the value head, and “is a completed work” for the “is done” head.
I’d be curious whether this would enable GPT to significantly improve. Specifically, would you be able to generate longer works with less intervention?
- Jalex Stark 23 Sep 2020 2:17 UTC
  10 points
  Parent
  See GPT-f for combining a transformer model (with pre-trained language weights?) with alphazero style training to learn to prove theorems
  - ESRogs 23 Sep 2020 4:17 UTC
    4 points
    Parent
    Oh, I had actually seen that paper. Forgot that they did that though. Thanks!