I consider this system to be superhuman, and the problem of aligning it to be “alignment-complete” in the sense that if you solve any of the problems in this class, you essentially solve alignment down the line and probably avoid x-risk,
I find this line of reasoning (and even mentioning it) not useful. Any alignment solution will be alignment complete so it’s tautological.
I think you’ve defined alignment as a hard problem, which no one will disagree with, but you also define any steps taken towards solving the alignment problem as alignment complete, and thus impossible unless they also seem infeasibly hard. Can there not be an iterative way to solve alignment? I think we can construct some trivial hypotheticals where we iteratively solve it.
For the sake of argument say I created a superhuman math theorem solver, something that can solve IMO problems written in lean with ease. I then use it to solve a lot of important math problems within alignment. This in turn affords us strong guarantees about certain elements of alignment or gradient descent. Can you convince me that the solution to getting a narrow AI useful for alignment is as hard as aligning a generally superhuman AI?
What if we reframe it to some real world example. The proof for the Riemann hypothesis begins with a handful of difficult but comparatively simple lemmas. Solving those lemmas is not as hard as solving the Reimann hypothesis. And we can keep decomposing this proof into parts that are simpler than the whole.
A step in a process being simpler than the end result of the process is not an argument against that step.
I find this line of reasoning (and even mentioning it) not useful. Any alignment solution will be alignment complete so it’s tautological.
I think you’ve defined alignment as a hard problem, which no one will disagree with, but you also define any steps taken towards solving the alignment problem as alignment complete, and thus impossible unless they also seem infeasibly hard. Can there not be an iterative way to solve alignment? I think we can construct some trivial hypotheticals where we iteratively solve it.
For the sake of argument say I created a superhuman math theorem solver, something that can solve IMO problems written in lean with ease. I then use it to solve a lot of important math problems within alignment. This in turn affords us strong guarantees about certain elements of alignment or gradient descent. Can you convince me that the solution to getting a narrow AI useful for alignment is as hard as aligning a generally superhuman AI?
What if we reframe it to some real world example. The proof for the Riemann hypothesis begins with a handful of difficult but comparatively simple lemmas. Solving those lemmas is not as hard as solving the Reimann hypothesis. And we can keep decomposing this proof into parts that are simpler than the whole.
A step in a process being simpler than the end result of the process is not an argument against that step.