You assume that you don’t need to solve hard philosophical problems. But the superhuman researcher model probably will need to, right? Seems like a very difficult instance of weak-to-strong generalization, and I’m not sure how you would know whether you’ve successfully solved it.
(I’m referring to G.3 ALIGNMENT PLAN ASSUMPTIONS which says “We assume we do not need to solve hard philosophical questions of human values and value aggregation before we can align a superhuman researcher model well enough that it avoids egregiously catastrophic outcomes.”)
But the superhuman researcher model probably will need to, right?
Maybe not, if the goal of the plan is not to achieve full singularity, but just to use superhuman researcher for uncontroversial problems like life extension and making money.
Crossposting from X
You assume that you don’t need to solve hard philosophical problems. But the superhuman researcher model probably will need to, right? Seems like a very difficult instance of weak-to-strong generalization, and I’m not sure how you would know whether you’ve successfully solved it.
(I’m referring to G.3 ALIGNMENT PLAN ASSUMPTIONS which says “We assume we do not need to solve hard philosophical questions of human values and value aggregation before we can align a superhuman researcher model well enough that it avoids egregiously catastrophic outcomes.”)
Here’s a previous discussion between @janleike and me on the topic of philosophical problems in AI alignment for anyone interested in more details on our perspectives https://www.lesswrong.com/posts/FAJWEfXxws8pMp8Hk/link-why-i-m-optimistic-about-openai-s-alignment-approach?commentId=pu3SJfqAZDSskQiyo
Maybe not, if the goal of the plan is not to achieve full singularity, but just to use superhuman researcher for uncontroversial problems like life extension and making money.