One particular feature to emphasize in the fusion generator scenario: restricting access to the AI is necessary, since the AI is capable of giving out plans for garage nukes. But it’s not sufficient. The user may be well-intentioned in asking for a fusion power generator design, but they don’t know what questions they need to ask in order to check that the design is truly safe to release. One could imagine a very restricted group of users trying to use the AI to help the world, asking for a fusion power generator, and not thinking to ask whether the design can be turned into a bomb.
I expect exactly the same problem to affect the “ask the AI to solve alignment” strategy. Even if the user is well-intentioned, they don’t know what questions to ask to check that the AI has solved alignment and handled all the major safety problems. And even if the AI is capable of solving alignment, that does not imply that it will do so correctly in response to the user’s input. For instance, if it really is just a scaled-up GPT, then its response will be whatever it thinks human writing would look like, given that the writing starts with whatever alignment-related prompt the user input.
Oh I have no doubt that this is no guarantee of safety, but with the likelihood of AGI being something like GPT-N going up (and the solution to Alignment being nowhere in sight), I’m trying to think of purely practical solutions to push the risks as low as they will go. Something like keeping the model parameters secret, maybe not even publicizing the fact of its existence, using it only by committee, and only to attempt to solve alignment problems, whose proposed solutions are then checked by the Alignment community. Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me. Or that the solution to alignment the AI proposes turns out to be really hard to check.
Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me.
This scenario seems like the almost-inevitable default to me!
If we can’t solve alignment ourselves, and sufficiently well that we can implement it when designing/building/training/testing any “powerful enough” AI, then we can’t expect that any answer to ‘The solution to AI alignment is …’ prompt to be aligned, i.e. a valid solution.
Or that the solution to alignment the AI proposes turns out to be really hard to check.
Maybe I’m interpreting “the solution to alignment” too literally, but I’m having a hard time understanding why this wouldn’t almost inevitably be effectively impossible to check. What kind of ‘error rate’ is good enough? Have we ever bounded even remotely complex computations to a good enough degree? Given that any solution has to encode human values somehow (and to some degree), I’m having a hard time thinking of some way that ‘checking’ a solution wouldn’t be one of the most difficult engineering challenge ever completed.
One particular feature to emphasize in the fusion generator scenario: restricting access to the AI is necessary, since the AI is capable of giving out plans for garage nukes. But it’s not sufficient. The user may be well-intentioned in asking for a fusion power generator design, but they don’t know what questions they need to ask in order to check that the design is truly safe to release. One could imagine a very restricted group of users trying to use the AI to help the world, asking for a fusion power generator, and not thinking to ask whether the design can be turned into a bomb.
I expect exactly the same problem to affect the “ask the AI to solve alignment” strategy. Even if the user is well-intentioned, they don’t know what questions to ask to check that the AI has solved alignment and handled all the major safety problems. And even if the AI is capable of solving alignment, that does not imply that it will do so correctly in response to the user’s input. For instance, if it really is just a scaled-up GPT, then its response will be whatever it thinks human writing would look like, given that the writing starts with whatever alignment-related prompt the user input.
Oh I have no doubt that this is no guarantee of safety, but with the likelihood of AGI being something like GPT-N going up (and the solution to Alignment being nowhere in sight), I’m trying to think of purely practical solutions to push the risks as low as they will go. Something like keeping the model parameters secret, maybe not even publicizing the fact of its existence, using it only by committee, and only to attempt to solve alignment problems, whose proposed solutions are then checked by the Alignment community. Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me. Or that the solution to alignment the AI proposes turns out to be really hard to check.
This scenario seems like the almost-inevitable default to me!
If we can’t solve alignment ourselves, and sufficiently well that we can implement it when designing/building/training/testing any “powerful enough” AI, then we can’t expect that any answer to ‘The solution to AI alignment is …’ prompt to be aligned, i.e. a valid solution.
Maybe I’m interpreting “the solution to alignment” too literally, but I’m having a hard time understanding why this wouldn’t almost inevitably be effectively impossible to check. What kind of ‘error rate’ is good enough? Have we ever bounded even remotely complex computations to a good enough degree? Given that any solution has to encode human values somehow (and to some degree), I’m having a hard time thinking of some way that ‘checking’ a solution wouldn’t be one of the most difficult engineering challenge ever completed.