Wireheading traps.
An agent is “wireheading” if it is taking an action that a) provides it with enormous amounts of utility for little effort, b) is trivial or seemingly unrelated to its “main” utility function or goals.
People have discussed the possibility of an AI wireheading as a problem for getting it to do what you want, “what if it just works out a way to set your metric to +ve infinity and then goes to sleep satisfied?”.
But we can also use this as a guard-rail.
A “wireheading trap” is an action which a) is very hard for an AI to do below a level of capacity X, but very easy for it to do afterwards, b) provides it with vastly more utility than anything else it could do, and c) has little effect on the world otherwise.
An AI with a wireheading trap that is set to be doable if it attains world-affecting capabilities “fails closed”, because if it becomes more capable it will effectively switch itself off by wireheading. An example of a trap could be something like “etch ‘Bill is a poo poo head’ in 10m high letters into Mt Everest”. Very hard if you don’t have the ability to affect the physical world, but if you have nanotech… why bother melting humanity when you can just deface Mt Everest and be done with it?
Obvious problems:
Convergent instrumental goals. We don’t want the AI to wipe us out in order to avoid us stopping it from wireheading, however unlikely that might be. Unclear what to do about this, myopia is one possibility (once you’ve wireheaded you don’t care if you’re stopped later?), but this has problems.
If you make it too attractive the AI won’t even try and be useful before it has enough capacity, instead just wasting all its time on futile attempts to wirehead. Unclear how to make it attractive enough that it dominates once the capability is there but not before.
Overall very half-baked, but I wonder if there’s something to be done in the general area of “have the AI behave in a way that neuters it, but only when its capabilities increase”.
Often the effect of being blinded is that you take suboptimal actions. As you pointed out in your example, if you see the problem then all sorts of cheap ways to reduce the harmful impact occur to you. So perhaps one way of getting to the issue could be to point at that: “I know you care about my feelings, and it wouldn’t have made this meeting any less effective to have had it more privately, so I’m surprised that you didn’t”?