This suggests that part of corrigibility could be framed as bargaining, using a solution concept that’s much more in favor of the principal than fairness, to the extent bounded only by anti-goodharting. Fairness (and its less fair variants) usually needs the concept of status quo, including for the principal, and status quo is somewhat similar to consequences of shutting down (especially when controlling much of the world), which might be explained as the result of extreme anti-goodharting. And less extreme anti-goodharting makes an agent vulnerable to modification out-of-permitted-distribution, perhaps by the agent itself fulfilling an appropriate bargain.
Another thing this reminds me of is ASP Problem, a Newcomb’s Problem variant where a stronger Agent must refrain from simulating a weaker Predictor/Omega and making straightforward use of the result (discarding it and two-boxing), instead it might want to think less and make itself predictable to the Predictor despite the advantage. Though the reason to do that lies entirely in Agent’s values and not in a bargaining concept. This serves to make a finer distinction between a program that happens to say “NO” if you decide to mercilessly run it to completion, and a rock with the word “NO” on it. You can’t control the rock, but you might be able to control the program if it’s attempting to reason about you, by not making it too difficult for the program to succeed.
This suggests that part of corrigibility could be framed as bargaining, using a solution concept that’s much more in favor of the principal than fairness, to the extent bounded only by anti-goodharting. Fairness (and its less fair variants) usually needs the concept of status quo, including for the principal, and status quo is somewhat similar to consequences of shutting down (especially when controlling much of the world), which might be explained as the result of extreme anti-goodharting. And less extreme anti-goodharting makes an agent vulnerable to modification out-of-permitted-distribution, perhaps by the agent itself fulfilling an appropriate bargain.
Another thing this reminds me of is ASP Problem, a Newcomb’s Problem variant where a stronger Agent must refrain from simulating a weaker Predictor/Omega and making straightforward use of the result (discarding it and two-boxing), instead it might want to think less and make itself predictable to the Predictor despite the advantage. Though the reason to do that lies entirely in Agent’s values and not in a bargaining concept. This serves to make a finer distinction between a program that happens to say “NO” if you decide to mercilessly run it to completion, and a rock with the word “NO” on it. You can’t control the rock, but you might be able to control the program if it’s attempting to reason about you, by not making it too difficult for the program to succeed.