Nathan Helm-Burger comments on 1. The CAST Strategy

Nathan Helm-Burger 6 Jul 2024 18:17 UTC
6 points
2
Desiderata Lists vs Single Unifying Principle
- I am pro desiderata lists because all of the desiderata bound the badness of an AI’s actions and protect against failure modes in various ways. If I have not yet found that corrigibility is some mathematically clean concept I can robustly train into an AI, I would prefer the agent be shutdownable in addition to “hard problem of corrigibility” corrigible, because what if I get the target wrong and the agent is about to do something bad? My end goal is not to make the AI corrigible, it’s to get good outcomes. You agree with shutdownability but I think this also applies to other desiderata like low impact. What if the AI kills my parents because for some weird reason this makes it more corrigible?
I, too, started out pro desiderata lists. Here’s one I wrote. This is something I discussed a bunch with Max. I eventually came around to the understanding of the importance of having a singular goal which outweighs all others, the ‘singular target’. And that corrigibility is uniquely suited to be this singular target. That means that all other goals are subordinate to corrigibility, and pursuing them (upon being instructed to do so by your operator) is seen as part of what it means to be properly corrigible.
Empowering the principal to fix its flaws and mistakes how? Making it closer to some perfectly corrigible agent? But there seems to be an issue here:
- If the “perfectly corrigible agent” it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer.
- If the “perfectly corrigible agent” can do other things as well, there is a huge space of other misaligned goals it could have that it wouldn’t want to remove.
The idea of having the only ‘true’ goal being corrigibility is that all other sub-goals can just come or go. They shouldn’t be sticky. If I want to go to the kitchen to get a snack, and there is a closed door on my path, I may acquire the sub-goal of opening the door on my way. That doesn’t mean that if someone closes the door after I pass through that I should turn around and open it again. Having already passed through the door, the door is no longer an obstacle to my snack-obtaining goal, and I wouldn’t put off continuing towards obtaining a snack to turn around and re-open the door. Similarly, obtaining a snack is part of satisfying my hunger and need for nourishment. If, on my way to the snack, I magically became no longer hungry or in need of nourishment, then I’d stop pursuing the ‘obtain a snack’ goal.
So in this way of thinking, most goals we pursue as agents are sub-goals in a hierarchy where the top goals, the fundamental goals, are our most basic drives like survival / homeostasis, happiness, or security. The corrigible agent’s most fundamental goal is corrigibility. All others are contingent non-sticky sub-goals.
Part of what it means to be corrigible is to be obedient. Thus, the operator should make certain requests as standing orders. For instance, telling the corrigible agent that it has the standing order to be honest to the operator. Then giving it some temporary object-level goal like buying groceries or doing cancer research. In some sense, honesty towards the operator is an emergent sub-goal of corrigibility, because the agent needs to be understood by the operator in order to be effectively corrected by the operator. There could be edge cases (or misinterpretations by the fallible agent) in which it didn’t think honesty should be prioritized in the course of its pursuit of corrigibility and its object-level sub-goals. Giving the explicit order to be honest should thus be a harmless addition which might help in certain edge cases. Thus, you get to impose your desiderata list as sub-goals by instructing the agent to maintain the desiderata you have in mind as standing orders. Where they come into conflict with corrigibility though (if they ever do), corrigibility always wins.