Are you working on this because you expect our first AGIs to be such hierarchical systems of subagents?
Or because you expect systems in which AGIs supervise subagents?
In either case, isn’t the key question still whether the agent(s) at the top of the hierarchy are aligned?
In other areas of complex systems (economics, politics and nations, and notably psychology), mathematical formulations address sub-parts of the systems, but typically are not relied on for an overall analysis. Instead, understanding complex systems requires integrating a number of tools for understanding different parts, levels, and aspects of the system.
I worry that the cultural foundations of AI alignment bias the people most serious about it to focus excessively on mathematical/formal approaches.
I expect “first AGI” to be reasonably modelled as composite structure in a similar way as a single human mind can be modelled as composite.
The “top” layer in the hierarchical agency sense isn’t necessarily the more powerful / agenty: the superagent/subagent direction is completely independent from relative powers. For example, you can think about Tea Appreciation Society at a university using the hierarchical frame: while the superagent has some agency, it is not particularly strong.
I think the nature of the problem here is somewhat different than typical research questions in e.g. psychology. As discussed in the text, one place where having mathematical theory of hierarchical agency would help is making us better at specifications of value evolution. I think this is the case because a specification would be more robust to scaling of intelligence. For example, compare learning objective a. specified as minimizing KL divergence between some distributions b. specified in natural language as “you should adjust the model so the things read are less surprising and unexpected” You can use objective b. + RL to train/finetune LLMs, exactly like RLAIF is used to train “honesty”, for example. Possible problem with b. is the implicit representations of natural language concepts like honesty or surprise are likely not very stable: if you would train a model mostly on RL + however Claude understands these words, you would probably get pathological results, or at least something far from how you understand the concepts. Actual RLAIF/RLHF/DPO/… works mostly because it is relatively shallow: more compute goes into pre training.
Ah. Now I understand why you’re going this direction.
I think a single human mind is modeled very poorly as a composite of multiple agents.
This notion is far more popular with computer scientists than with neuroscientists. We’ve known about it since Minsky and think about it; it just doesn’t seem to mostly be the case.
Sure you can model it that way, but it’s not doing much useful work.
I expect the same of our first AGIs as foundation model agents. They will have separate components, but those will not be well-modeled as agents. And they will have different capabilities and different tendencies, but neither of those are particularly agent-y either.
I guess the devil is in the details, and you might come up with a really useful analysis using the metaphor of subagents. But it seems like an inefficient direction.
Are you working on this because you expect our first AGIs to be such hierarchical systems of subagents?
Or because you expect systems in which AGIs supervise subagents?
In either case, isn’t the key question still whether the agent(s) at the top of the hierarchy are aligned?
In other areas of complex systems (economics, politics and nations, and notably psychology), mathematical formulations address sub-parts of the systems, but typically are not relied on for an overall analysis. Instead, understanding complex systems requires integrating a number of tools for understanding different parts, levels, and aspects of the system.
I worry that the cultural foundations of AI alignment bias the people most serious about it to focus excessively on mathematical/formal approaches.
I expect “first AGI” to be reasonably modelled as composite structure in a similar way as a single human mind can be modelled as composite.
The “top” layer in the hierarchical agency sense isn’t necessarily the more powerful / agenty: the superagent/subagent direction is completely independent from relative powers. For example, you can think about Tea Appreciation Society at a university using the hierarchical frame: while the superagent has some agency, it is not particularly strong.
I think the nature of the problem here is somewhat different than typical research questions in e.g. psychology. As discussed in the text, one place where having mathematical theory of hierarchical agency would help is making us better at specifications of value evolution. I think this is the case because a specification would be more robust to scaling of intelligence. For example, compare learning objective
a. specified as minimizing KL divergence between some distributions
b. specified in natural language as “you should adjust the model so the things read are less surprising and unexpected”
You can use objective b. + RL to train/finetune LLMs, exactly like RLAIF is used to train “honesty”, for example.
Possible problem with b. is the implicit representations of natural language concepts like honesty or surprise are likely not very stable: if you would train a model mostly on RL + however Claude understands these words, you would probably get pathological results, or at least something far from how you understand the concepts. Actual RLAIF/RLHF/DPO/… works mostly because it is relatively shallow: more compute goes into pre training.
Ah. Now I understand why you’re going this direction.
I think a single human mind is modeled very poorly as a composite of multiple agents.
This notion is far more popular with computer scientists than with neuroscientists. We’ve known about it since Minsky and think about it; it just doesn’t seem to mostly be the case.
Sure you can model it that way, but it’s not doing much useful work.
I expect the same of our first AGIs as foundation model agents. They will have separate components, but those will not be well-modeled as agents. And they will have different capabilities and different tendencies, but neither of those are particularly agent-y either.
I guess the devil is in the details, and you might come up with a really useful analysis using the metaphor of subagents. But it seems like an inefficient direction.