Rafael Harth comments on Benign model-free RL

Rafael Harth 16 Sep 2020 17:13 UTC
4 points
This post does not explicitly mention the difference between strong and weak HCH, but it links to meta-execution, which is trying to implement strong HCH.

In strong HCH, subagents don’t disappear after having been consulted once. Instead, whenever a human consults a subtree, she can remember the ID of the subtree and choose to consult it again later. (She can even send messages that contain pointers to other agents, but my concern already arises if we don’t include that change.)

This difference seems like an extremely big deal. In particular, it makes it unclear why we would even need a human in the process. If $A_{k}$ is already stronger than $H$ , then the system $[H access ⟶ A_{k}]$ may not benefit from having $H$ manage anything. Instead, $H$ could just repeatedly query the same copy of $A_{k}$ to make progress on the input question.

In this case, there is no longer a reason to model this process via Factored Cognition. In IDA, only the highest level of Factored Cognition happens explicitly (the human may actually divide a question), and every other step happens implicitly (we ‘speculate’ that the model $A_{k}$ uses FC because it imitates what $[H access ⟶ A_{k - 1}]$ Is doing). So if the highest level doesn’t use Factored Cognition anymore, then the entire scheme doesn’t use Factored Cognition anymore.

We’re left with agent $A_{k}$ overseeing agent $A_{k + 1}$ , where $A_{k}$ is allowed to think for much longer than $A_{k + 1}$ .

This might also be a scheme that works, but insofar as dumping Factored Cognition altogether is undesirable, it seems like meta-execution needs a mechanism to discourage this behavior. Since $H$ is a human, I guess we could just tell her not to do this? But if we do, in fact, want to discourage this behavior, it seems like the change introduces a spectrum where we trade off security for capability, rather than making a straight-forward improvement to the scheme.

Is there something I’m not seeing? Would appreciate input on this.