OpenAI Superalignment: Weak-to-strong generalization

Dalmert14 Dec 2023 19:47 UTC

25 points

3 comments1 min readLW link

Link post

Wei Dai 15 Dec 2023 7:08 UTC
4 points
2
Crossposting from X

You assume that you don’t need to solve hard philosophical problems. But the superhuman researcher model probably will need to, right? Seems like a very difficult instance of weak-to-strong generalization, and I’m not sure how you would know whether you’ve successfully solved it.

(I’m referring to G.3 ALIGNMENT PLAN ASSUMPTIONS which says “We assume we do not need to solve hard philosophical questions of human values and value aggregation before we can align a superhuman researcher model well enough that it avoids egregiously catastrophic outcomes.”)

Here’s a previous discussion between @janleike and me on the topic of philosophical problems in AI alignment for anyone interested in more details on our perspectives https://www.lesswrong.com/posts/FAJWEfXxws8pMp8Hk/link-why-i-m-optimistic-about-openai-s-alignment-approach?commentId=pu3SJfqAZDSskQiyo
- Signer 15 Dec 2023 11:21 UTC
  3 points
  0
  Parent
  
  But the superhuman researcher model probably will need to, right?
  
  Maybe not, if the goal of the plan is not to achieve full singularity, but just to use superhuman researcher for uncontroversial problems like life extension and making money.
keith_wynroe 15 Dec 2023 15:07 UTC
2 points
0
I know they flag it in the paper, but seeing the performance curves for the strong model on zero- and few-shot attempts really makes me think the data leakage issue is doing a lot of the work here. If you get the majority(?) of the PGR from e.g. 5-shot prompting it seems like a natural takeaway is the strong model doesn’t actually need to be fine-tuned on the task, and the weak supervisor is just eliciting the knowledge that’s already there