The first part of this post seems to rest upon an assumption that any subagents will have long-term goals that they are trying to optimize, which can cause competition between subagents. It seems possible to instead pursue subgoals under a limited amount of time, or using a restricted action space, or using only “normal” strategies. When I write a post, I certainly am treating it as a subgoal—I don’t typically think about how the post contributes to my overall goals while writing it, I just aim to write a good post. Yet I don’t recheck every word or compress each sentence to be maximally informative. Perhaps this is because that would be a new strategy I haven’t used before and so I evaluate it with my overall goals, instead of just the “good post” goal, or perhaps it’s because my goal also has time constraints embedded in it, or something else, but in any case it seems wrong to think of post-writing-Rohin as optimizing long term preferences for writing as good a post as possible.
This agent design treats the system’s epistemic and instrumental subsystems as discrete agents with goals of their own, which is not particularly realistic.
Nitpick: I think the bigger issue is that the epistemic subsystem doesn’t get to observe the actions that the agent is taking. That’s the easiest way to distinguish delusion box behavior from good behavior. (I call this a nitpick because this isn’t about the general point that if you have multiple subagents with different goals, they may compete.)
The first part of this post seems to rest upon an assumption that any subagents will have long-term goals that they are trying to optimize, which can cause competition between subagents. It seems possible to instead pursue subgoals under a limited amount of time, or using a restricted action space, or using only “normal” strategies. When I write a post, I certainly am treating it as a subgoal—I don’t typically think about how the post contributes to my overall goals while writing it, I just aim to write a good post. Yet I don’t recheck every word or compress each sentence to be maximally informative. Perhaps this is because that would be a new strategy I haven’t used before and so I evaluate it with my overall goals, instead of just the “good post” goal, or perhaps it’s because my goal also has time constraints embedded in it, or something else, but in any case it seems wrong to think of post-writing-Rohin as optimizing long term preferences for writing as good a post as possible.
Nitpick: I think the bigger issue is that the epistemic subsystem doesn’t get to observe the actions that the agent is taking. That’s the easiest way to distinguish delusion box behavior from good behavior. (I call this a nitpick because this isn’t about the general point that if you have multiple subagents with different goals, they may compete.)