Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me.
This scenario seems like the almost-inevitable default to me!
If we can’t solve alignment ourselves, and sufficiently well that we can implement it when designing/building/training/testing any “powerful enough” AI, then we can’t expect that any answer to ‘The solution to AI alignment is …’ prompt to be aligned, i.e. a valid solution.
Or that the solution to alignment the AI proposes turns out to be really hard to check.
Maybe I’m interpreting “the solution to alignment” too literally, but I’m having a hard time understanding why this wouldn’t almost inevitably be effectively impossible to check. What kind of ‘error rate’ is good enough? Have we ever bounded even remotely complex computations to a good enough degree? Given that any solution has to encode human values somehow (and to some degree), I’m having a hard time thinking of some way that ‘checking’ a solution wouldn’t be one of the most difficult engineering challenge ever completed.
This scenario seems like the almost-inevitable default to me!
If we can’t solve alignment ourselves, and sufficiently well that we can implement it when designing/building/training/testing any “powerful enough” AI, then we can’t expect that any answer to ‘The solution to AI alignment is …’ prompt to be aligned, i.e. a valid solution.
Maybe I’m interpreting “the solution to alignment” too literally, but I’m having a hard time understanding why this wouldn’t almost inevitably be effectively impossible to check. What kind of ‘error rate’ is good enough? Have we ever bounded even remotely complex computations to a good enough degree? Given that any solution has to encode human values somehow (and to some degree), I’m having a hard time thinking of some way that ‘checking’ a solution wouldn’t be one of the most difficult engineering challenge ever completed.