We don’t know how to formally define corrigibility and this is part of the reason why we haven’t solved it so far. Corrigibility is defined in Arbital as:
[the agent] doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it
I think this is not the best definition of corrigibility, As defined in MIRI’s paper section 1.1:
We say that an agent is “corrigible” if it tolerates or assists many forms of outside correction, including at least the following: (1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system. (2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so. (3) It should have a tendency to repair safety measures (such as shutdown buttons) if they break, or at least to notify programmers that this breakage has occurred. (4) It must preserve the programmers’ ability to correct or shut down the system (even as the system creates new subsystems or self-modifies). That is, corrigible reasoning should only allow an agent to create new agents if these new agents are also corrigible.
In theory, a corrigible agent should not be misaligned if we successfully integrate these four tolerance behaviors.
I think this is not the best definition of corrigibility, As defined in MIRI’s paper section 1.1:
In theory, a corrigible agent should not be misaligned if we successfully integrate these four tolerance behaviors.
Another relevant material by Rob Miles here on why the shutdown problem is very difficult to solve.