The AI may stumble upon a plan which contains a sequence of words that hacks the approver’s mind, making him approve pretty much anything. Such plans may even be easier for the AI to generate than plans for saving the world, seeing as Eliezer has won some AI-box experiments but hasn’t yet solved world hunger.
Um, does the approver also have to approve each step of the computation that builds the plan to be submitted for approval? Isn’t this infinite regress?
The AI may stumble upon a plan which contains a sequence of words that hacks the approver’s mind, making him approve pretty much anything. Such plans may even be easier for the AI to generate than plans for saving the world, seeing as Eliezer has won some AI-box experiments but hasn’t yet solved world hunger.
You mean accidentally stumble upon such a sequence of words? Because purposefully building one would certainly not be approved.
Um, does the approver also have to approve each step of the computation that builds the plan to be submitted for approval? Isn’t this infinite regress?
Consider “Ask for approval” as an auto-approved action. Not sure if that solves it, will give this a little more thought.