Without even getting into whether your specific reward heuristic is misaligned, it seems to me that you’d just shifted the problem slightly out of the focus of your description of the system, by specifying that all of the work will be done by subsystems that you’re just assuming will be safe.
“paperclip quality control” has just as much potential for misalignment in the limit as does paperclip maximization, depending on what kind of agent you use to accomplish it. So, even if we grant the assumption that your heuristic is aligned, we are merely left with the task of designing a bunch of aligned agents to do subtasks.
Paperclip quality control is an agent that was trained on simulated sensor inputs (camera images and whatever else) of variations of paperclip. Paperclips that are not within a narrow range of dimensions and other measurements for correctness are rejected.
It doesn’t have any learning ability. It is literally an overgrown digital filter that takes in some dimensions of input image and outputs true or false to accept or reject. (and probably another vector specifying the checks failed)
We can describe every subagent for everything the factory needs as such limited, narrow domain machines that alignment issues are not possible. (Especially as most will have no memory and all will have learning disabled)
Without even getting into whether your specific reward heuristic is misaligned, it seems to me that you’d just shifted the problem slightly out of the focus of your description of the system, by specifying that all of the work will be done by subsystems that you’re just assuming will be safe.
“paperclip quality control” has just as much potential for misalignment in the limit as does paperclip maximization, depending on what kind of agent you use to accomplish it. So, even if we grant the assumption that your heuristic is aligned, we are merely left with the task of designing a bunch of aligned agents to do subtasks.
Paperclip quality control is an agent that was trained on simulated sensor inputs (camera images and whatever else) of variations of paperclip. Paperclips that are not within a narrow range of dimensions and other measurements for correctness are rejected.
It doesn’t have any learning ability. It is literally an overgrown digital filter that takes in some dimensions of input image and outputs true or false to accept or reject. (and probably another vector specifying the checks failed)
We can describe every subagent for everything the factory needs as such limited, narrow domain machines that alignment issues are not possible. (Especially as most will have no memory and all will have learning disabled)