Here are some objections I have to your post: How are you going to specify the amount of optimization pressure the AI exerts on answering a question/solving a problem? Are you hoping to start out training a weaker AI that you later augment? If so, I’d be concerned about any distributional shifts in its optimization process that occur during that transition If not, it’s not clear to me how you have the AI ‘be safe’ through this training process.
At the point where you, the human, is labeling data to train the AI to identify concepts with measurements/feature—you now have a loss function that’s dependent on human feedback, and which, once again, you can’t specify in terms of the concepts you want the AI to identify. It seems like the AI is pretty incentivized to be deceptive here (or really at any point in the process). I.e. if i’s superintelligent and you accidentally gave it the loss function ‘maximize paperclips’, but it models humans as potentially not realizing they gave it this loss function, then I think it would act indistinguishably from an AI with the loss function you intended (at least during this stage of training you outline).
Even if, say, it does do things at first that look like things a paperclip maximizer would try to do, instead of whatever you actually want it to do (label things appropriately) - say, it tries to get a human user to upload it to the internet or something, but your safe-guards are sufficiently strong to prevent things like this—then I think as you train away actions like this, you’re not just training it to have better utility functions or whatever, but you’re training it to be more effectively deceptive.
Here are some objections I have to your post:
How are you going to specify the amount of optimization pressure the AI exerts on answering a question/solving a problem? Are you hoping to start out training a weaker AI that you later augment?
If so, I’d be concerned about any distributional shifts in its optimization process that occur during that transition
If not, it’s not clear to me how you have the AI ‘be safe’ through this training process.
At the point where you, the human, is labeling data to train the AI to identify concepts with measurements/feature—you now have a loss function that’s dependent on human feedback, and which, once again, you can’t specify in terms of the concepts you want the AI to identify. It seems like the AI is pretty incentivized to be deceptive here (or really at any point in the process).
I.e. if i’s superintelligent and you accidentally gave it the loss function ‘maximize paperclips’, but it models humans as potentially not realizing they gave it this loss function, then I think it would act indistinguishably from an AI with the loss function you intended (at least during this stage of training you outline).
Even if, say, it does do things at first that look like things a paperclip maximizer would try to do, instead of whatever you actually want it to do (label things appropriately) - say, it tries to get a human user to upload it to the internet or something, but your safe-guards are sufficiently strong to prevent things like this—then I think as you train away actions like this, you’re not just training it to have better utility functions or whatever, but you’re training it to be more effectively deceptive.