Do you mean as a checkpoint/breaker? Or more in the sense of RLHF?
The problem with these is that the limiting factor is human attention. You can’t have your best and brightest focusing full time on what the AI is outputting, as quality attention is a scarce resource. So you do some combination of the following:
slow down the AI to a level that is manageable (which will greatly limit its usefulness),
discard ideas that are too strange (which will greatly limit its usefulness)
limit its intelligence (which will greatly limit its usefulness)
don’t check so thoroughly (which will make it more dangerous)
use less clever checkers (which will greatly limit its usefulness and/or make it more dangerous)
check it reeeaaaaaaallllly carefully during testing and then hope for the best (which is reeeeaaaallly dangerous)
You also need to be sure that it can’t outsmart the humans in the loop, which pretty much comes back to boxing it in.
Do you mean as a checkpoint/breaker? Or more in the sense of RLHF?
The problem with these is that the limiting factor is human attention. You can’t have your best and brightest focusing full time on what the AI is outputting, as quality attention is a scarce resource. So you do some combination of the following:
slow down the AI to a level that is manageable (which will greatly limit its usefulness),
discard ideas that are too strange (which will greatly limit its usefulness)
limit its intelligence (which will greatly limit its usefulness)
don’t check so thoroughly (which will make it more dangerous)
use less clever checkers (which will greatly limit its usefulness and/or make it more dangerous)
check it reeeaaaaaaallllly carefully during testing and then hope for the best (which is reeeeaaaallly dangerous)
You also need to be sure that it can’t outsmart the humans in the loop, which pretty much comes back to boxing it in.