Oh I have no doubt that this is no guarantee of safety, but with the likelihood of AGI being something like GPT-N going up (and the solution to Alignment being nowhere in sight), I’m trying to think of purely practical solutions to push the risks as low as they will go. Something like keeping the model parameters secret, maybe not even publicizing the fact of its existence, using it only by committee, and only to attempt to solve alignment problems, whose proposed solutions are then checked by the Alignment community. Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me. Or that the solution to alignment the AI proposes turns out to be really hard to check.
Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me.
This scenario seems like the almost-inevitable default to me!
If we can’t solve alignment ourselves, and sufficiently well that we can implement it when designing/building/training/testing any “powerful enough” AI, then we can’t expect that any answer to ‘The solution to AI alignment is …’ prompt to be aligned, i.e. a valid solution.
Or that the solution to alignment the AI proposes turns out to be really hard to check.
Maybe I’m interpreting “the solution to alignment” too literally, but I’m having a hard time understanding why this wouldn’t almost inevitably be effectively impossible to check. What kind of ‘error rate’ is good enough? Have we ever bounded even remotely complex computations to a good enough degree? Given that any solution has to encode human values somehow (and to some degree), I’m having a hard time thinking of some way that ‘checking’ a solution wouldn’t be one of the most difficult engineering challenge ever completed.
Oh I have no doubt that this is no guarantee of safety, but with the likelihood of AGI being something like GPT-N going up (and the solution to Alignment being nowhere in sight), I’m trying to think of purely practical solutions to push the risks as low as they will go. Something like keeping the model parameters secret, maybe not even publicizing the fact of its existence, using it only by committee, and only to attempt to solve alignment problems, whose proposed solutions are then checked by the Alignment community. Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me. Or that the solution to alignment the AI proposes turns out to be really hard to check.
This scenario seems like the almost-inevitable default to me!
If we can’t solve alignment ourselves, and sufficiently well that we can implement it when designing/building/training/testing any “powerful enough” AI, then we can’t expect that any answer to ‘The solution to AI alignment is …’ prompt to be aligned, i.e. a valid solution.
Maybe I’m interpreting “the solution to alignment” too literally, but I’m having a hard time understanding why this wouldn’t almost inevitably be effectively impossible to check. What kind of ‘error rate’ is good enough? Have we ever bounded even remotely complex computations to a good enough degree? Given that any solution has to encode human values somehow (and to some degree), I’m having a hard time thinking of some way that ‘checking’ a solution wouldn’t be one of the most difficult engineering challenge ever completed.