Complete Feedback

A simple, weak notion of corrigibility is having a “complete” feedback interface. In logical induction terms, I mean the AI trainer can insert any trader into the market. I want to contrast this with “partial” feedback, in which only some propositions get feedback and others (“latent” propositions) form the structured hypotheses which help predict the observable propositions—for example, RL, where only rewards and sense-data is observed.

(Note: one might think that the ability to inject traders into LI is still “incomplete” because traders can give feedback on the propositions themselves, not on other traders; so the trader weights constitute “latents” being estimated. However, a trader can effectively vote against another trader by computing all that trader’s trades and counterbalancing them. Of course, we can also more directly facilitate this, EG giving the user the ability to directly modify trader weights, and even giving traders an enhanced ability to bet on each other’s weights.)

Why is this close to corrigibility?

The idea is that the trainer can enact “any” modification they’d like to make to the system as a trader. In some sense (which I need to articulate better), the system doesn’t have any incentive to avoid this feedback.

For example, if the AI predicts that the user will soon give it the feedback that staying still and doing nothing is best, then it will immediately start staying still and doing nothing. If this is undesirable, the user can instead plan to give the feedback that the AI should “start staying still from now forward until I tell you otherwise” or some such.

This is not to say that the AI universally tries to update in whatever direction it anticipates the users might update it towards later. This is not like the RL setting, where there is no way for trainers to give feedback ruling out the “whatever the user will reward is good” hypothesis. The user can and should give feedback against this hypothesis!

The AI system accepts all previous feedback, but it may or may not trust anticipated future feedback. In particular, it should be trained not to trust feedback it would get by manipulating humans (so that it doesn’t see itself as having an incentive to manipulate humans to give specific sorts of feedback).

I will call this property of feedback “legitimacy”. The AI has a notion of when feedback is legitimate, and it needs to work to keep feedback legitimate (by not manipulating the human).

It’s still the case that if a hypothesis has enough initial weight in the system, and it buys a pattern of propositions which end up (causally) manipulating the human trainer to reinforce that pattern of propositions, such a hypothesis can tend to gain influence in the system. What I’m doing here is “splitting off” this problem from corrigibility, in some sense: this is an inner-optimizer problem. In order for this approach to corrigibility to be safe, the trainer needs to provide feedback against such inner-optimizers.

(Again, this is unlike the RL setting: in RL, hypotheses have a uniform incentive to get reward. For systems with complete feedback, different hypotheses are competing for different kinds of positive feedback. Still, this self-enforcing behavior needs to be discouraged by the trainer.)

This is not by any means a sufficient safety condition, since so much depends on the trainer being able to provide feedback against manipulative hypotheses, and train the system to have a robust concept of legitimate vs illegitimate feedback.

Instead, the argument is that this is a necessary safety condition in some sense. Systems with incomplete feedback will always have undesirable (malign) hypotheses which cannot be ruled out by feedback. For RL, this includes wireheading hypotheses (hypotheses which predict high reward from taking over control of the reinforcement signal) and human-manipulation hypotheses (hypotheses which predict high reward from manipulating humans to give high reward). For more exotic systems, this includes the “human simulator” failure mode which Paul Christiano detailed in the ELK report.

Note that this notion of corrigibility applies to both agentic and nonagentic systems. The AI system could be trained to act agentically or otherwise.

Two open technical questions wrt this:

  • What learning-theoretic properties can we guarantee for systems with complete feedback? In something like Solomonoff Induction, we get good learning-theoretic properties on the observable bits by virtue of the structured prior we’re able to build out of the latent bits. The “complete feedback” idea relies on getting good learning-theoretic properties with respect to everything. I think a modification of the Logical Induction Criterion will work here.

  • Can we simultaneously prevent a self-modification incentive (where the system self-modifies to ignore future feedback which it considers corrupt IE illegitimate—this would be very bad in cases where the system is wrong about legitimacy) while also avoiding a human-manipulation incentive (counting manipulation of humans as a form of corrupt feedback)?


You can support my work on Patreon.

No comments.