Hmm. In order to avoid goodharting, the composite agent should be structured such that its actions emerge coherently from the agencies of the subagents. Any top-down framework I can think of is a no-go and looking at how the subagents get their own agencies hints at infinite regress. My brain hurts (in a good way).
What specification the principal gives to the agent does not need to be the will or what ends up happening with the composite agent. In order to get goodharting there needs to be optimization and while goal specifications can be used for optimizations, specification involvement doesn’t necceciate the optimization happens. For example if the agent tried to be maximally harmful to the principal then the principal could “reverse psychology” goal specification which is not a representation the principal would use for themselfs but gets the desired result when used in that assignment.
The tricky thing then is that thing could be so that the pricniple hands in an assigment that triggers action that is wastly different from what the assignment describes but the end action ends up being beneficial to the composite system. That is principal goes “go buy me the most expensive hammer”, agent “thinks well I could get a million dollar hammer but I am just going to get a thousand dollar hammer instead of a ten dollar hammer”. The agent is influenced, different principles could result in the agent going for the ten dollar hammer, but assignment is significantly “unfullfilled”. A prompt that would naively read to point to the million dollar hammer, might almost never result in the million dollar hammer. So even if the shoplist writer “overdemands” and the shopvisiter “underdelivers” the overall system can end up making resonable actions (which no part of the components specified).
Big update toward the principal serving a coordinating function only (alwayshasbeen.png).
Subagents will unavoidably operate under their own agency; any design where their agenda is fully set from above would goodhart by definition. The only scenario where there’s non-goodhart coherence seems to be where there’s some sort of alignment between the principal’s agenda and the agency of the subagents.
ETA: The subagent receives the edict of the principal and fills in the details using its own agency. The resulting actions make sense to the extent the subagent has (and uses) “common sense”.
Hmm. In order to avoid goodharting, the composite agent should be structured such that its actions emerge coherently from the agencies of the subagents. Any top-down framework I can think of is a no-go and looking at how the subagents get their own agencies hints at infinite regress. My brain hurts (in a good way).
What specification the principal gives to the agent does not need to be the will or what ends up happening with the composite agent. In order to get goodharting there needs to be optimization and while goal specifications can be used for optimizations, specification involvement doesn’t necceciate the optimization happens. For example if the agent tried to be maximally harmful to the principal then the principal could “reverse psychology” goal specification which is not a representation the principal would use for themselfs but gets the desired result when used in that assignment.
The tricky thing then is that thing could be so that the pricniple hands in an assigment that triggers action that is wastly different from what the assignment describes but the end action ends up being beneficial to the composite system. That is principal goes “go buy me the most expensive hammer”, agent “thinks well I could get a million dollar hammer but I am just going to get a thousand dollar hammer instead of a ten dollar hammer”. The agent is influenced, different principles could result in the agent going for the ten dollar hammer, but assignment is significantly “unfullfilled”. A prompt that would naively read to point to the million dollar hammer, might almost never result in the million dollar hammer. So even if the shoplist writer “overdemands” and the shopvisiter “underdelivers” the overall system can end up making resonable actions (which no part of the components specified).
Big update toward the principal serving a coordinating function only (alwayshasbeen.png).
Subagents will unavoidably operate under their own agency; any design where their agenda is fully set from above would goodhart by definition. The only scenario where there’s non-goodhart coherence seems to be where there’s some sort of alignment between the principal’s agenda and the agency of the subagents.
ETA: The subagent receives the edict of the principal and fills in the details using its own agency. The resulting actions make sense to the extent the subagent has (and uses) “common sense”.