Jan_Kulveit comments on Hierarchical Agency: A Missing Piece in AI Alignment

Jan_Kulveit 4 Dec 2024 2:18 UTC
2 points
0
1. I expect “first AGI” to be reasonably modelled as composite structure in a similar way as a single human mind can be modelled as composite.
2. The “top” layer in the hierarchical agency sense isn’t necessarily the more powerful / agenty: the superagent/subagent direction is completely independent from relative powers. For example, you can think about Tea Appreciation Society at a university using the hierarchical frame: while the superagent has some agency, it is not particularly strong.
3. I think the nature of the problem here is somewhat different than typical research questions in e.g. psychology. As discussed in the text, one place where having mathematical theory of hierarchical agency would help is making us better at specifications of value evolution. I think this is the case because a specification would be more robust to scaling of intelligence. For example, compare learning objective
  a. specified as minimizing KL divergence between some distributions
  b. specified in natural language as “you should adjust the model so the things read are less surprising and unexpected”
  You can use objective b. + RL to train/finetune LLMs, exactly like RLAIF is used to train “honesty”, for example.
  Possible problem with b. is the implicit representations of natural language concepts like honesty or surprise are likely not very stable: if you would train a model mostly on RL + however Claude understands these words, you would probably get pathological results, or at least something far from how you understand the concepts. Actual RLAIF/RLHF/DPO/… works mostly because it is relatively shallow: more compute goes into pre training.
- Seth Herd 6 Dec 2024 23:25 UTC
  2 points
  0
  Parent
  Ah. Now I understand why you’re going this direction.
  
  I think a single human mind is modeled very poorly as a composite of multiple agents.
  
  This notion is far more popular with computer scientists than with neuroscientists. We’ve known about it since Minsky and think about it; it just doesn’t seem to mostly be the case.
  
  Sure you can model it that way, but it’s not doing much useful work.
  
  I expect the same of our first AGIs as foundation model agents. They will have separate components, but those will not be well-modeled as agents. And they will have different capabilities and different tendencies, but neither of those are particularly agent-y either.
  
  I guess the devil is in the details, and you might come up with a really useful analysis using the metaphor of subagents. But it seems like an inefficient direction.