That divergence between revealed “preferences” vs “preferences” in the sense of a goal passed to some kind of search/planning/decision process potentially opens up some approaches to solve the problem.
If the agent is not aware of all the potential ways it could cause harm, we cannot expect it to voluntarily initiate a shutdown mechanism when necessary. This is the furthest I have gotten in exploring the problem of corrigibility. My current understanding suggests that creating a comprehensive dataset that includes all possible failure scenarios is essential for building a strongly aligned AGI. Once the AI is fully ‘aware’ of its role in these catastrophic or failure scenarios, as well as its capabilities and limitations, it will be better equipped to make informed decisions, especially when presented with a shutdown option.
As I understand it, the shutdown problem isn’t about making the AI correctly decide whether it ought to be shut down. We’d surely like to have an AI that always makes correct decisions, and if we succeed at that then we don’t need special logic about shutting down, we can just apply the general make-correct-decisions procedure and do whatever the correct thing is.
But the idea here is to have a simpler Plan B that will prevent the worst-case scenarios even if you make a mistake in the fully-general make-correct-decisions implementation, and it starts making incorrect decisions. The goal is to be able to shut it down anyway, even when the AI is not equipped to correctly reason out the pros and cons of shutting down.
As I understand it, the shutdown problem isn’t about making the AI correctly decide whether it ought to be shut down. We’d surely like to have an AI that always makes correct decisions, and if we succeed at that then we don’t need special logic about shutting down, we can just apply the general make-correct-decisions procedure and do whatever the correct thing is.
Yes, this outcome stems from the idea that if we can consistently enable an AI system to initiate a shutdown when it recognizes potential harm to its users—even at very worst scenarios, we may eventually move beyond the need for a precise ‘shutdown button / mechanism’ and instead aim for an advanced version that allows the AI to pause and present alternative options.
But the idea here is to have a simpler Plan B that will prevent the worst-case scenarios even if you make a mistake in the fully-general make-correct-decisions implementation, and it starts making incorrect decisions. The goal is to be able to shut it down anyway, even when the AI is not equipped to correctly reason out the pros and cons of shutting down.
I have experimented with numerous simpler scenarios and consistently arrived at the conclusion that AI should have the capability to willingly initiate a shutdown scenario—which is not simple. When we scale this to worst-case scenarios, we enter the same realm I am advocating for: building a mechanism that enables an understanding of all failure scenarios from the outset.
If the agent is not aware of all the potential ways it could cause harm, we cannot expect it to voluntarily initiate a shutdown mechanism when necessary. This is the furthest I have gotten in exploring the problem of corrigibility. My current understanding suggests that creating a comprehensive dataset that includes all possible failure scenarios is essential for building a strongly aligned AGI. Once the AI is fully ‘aware’ of its role in these catastrophic or failure scenarios, as well as its capabilities and limitations, it will be better equipped to make informed decisions, especially when presented with a shutdown option.
As I understand it, the shutdown problem isn’t about making the AI correctly decide whether it ought to be shut down. We’d surely like to have an AI that always makes correct decisions, and if we succeed at that then we don’t need special logic about shutting down, we can just apply the general make-correct-decisions procedure and do whatever the correct thing is.
But the idea here is to have a simpler Plan B that will prevent the worst-case scenarios even if you make a mistake in the fully-general make-correct-decisions implementation, and it starts making incorrect decisions. The goal is to be able to shut it down anyway, even when the AI is not equipped to correctly reason out the pros and cons of shutting down.
Yes, this outcome stems from the idea that if we can consistently enable an AI system to initiate a shutdown when it recognizes potential harm to its users—even at very worst scenarios, we may eventually move beyond the need for a precise ‘shutdown button / mechanism’ and instead aim for an advanced version that allows the AI to pause and present alternative options.
I have experimented with numerous simpler scenarios and consistently arrived at the conclusion that AI should have the capability to willingly initiate a shutdown scenario—which is not simple. When we scale this to worst-case scenarios, we enter the same realm I am advocating for: building a mechanism that enables an understanding of all failure scenarios from the outset.