This is great. Big upvote for actually looking for helpful and realistic routes to alignment. We need more of that if we’re going to survive.
I think this labeling technique could be really useful for internal and external monitoring of a complex tree of thoughts in a language model cognitive architecture, and I’m adding it to the stack of alignment techniques that can be applied to such agents. I describe that stack in Internal independent review for language model agent alignment. I personally think this is the most likely route to first AGI, and the most likely route to successful alignment, since the alignment tax is very low and success seems at least plausible.
I find it highly plausible (but far from certain) that language model cognitive architectures (AKA language model agents) achieve real goal-directed agency, self-awareness, recursive self-improvement (but not foom).
This also seems like the shortest-timeline route to AGI and alignment, so considering this route seems a bit more urgent than longer-timeline routes.
Absolutely, while the suggestion in the post is just about controlling LLMs, this is obviously helpful to any A(G)I approach that uses LLMs, such as all various scafolding approaches. Being able to be a lot more sure of your LLM-simulated agent’s emotions/motivations/behavior/education level/etc is basically always helpful.
Yes. Particularly if you’re already performing an internal review for helpfulness and accuracy. I’ll think more about how those labels could be applied, and include this, with credit to you, in my next post on aligning language model agents.
This is great. Big upvote for actually looking for helpful and realistic routes to alignment. We need more of that if we’re going to survive.
I think this labeling technique could be really useful for internal and external monitoring of a complex tree of thoughts in a language model cognitive architecture, and I’m adding it to the stack of alignment techniques that can be applied to such agents. I describe that stack in Internal independent review for language model agent alignment. I personally think this is the most likely route to first AGI, and the most likely route to successful alignment, since the alignment tax is very low and success seems at least plausible.
I find it highly plausible (but far from certain) that language model cognitive architectures (AKA language model agents) achieve real goal-directed agency, self-awareness, recursive self-improvement (but not foom).
This also seems like the shortest-timeline route to AGI and alignment, so considering this route seems a bit more urgent than longer-timeline routes.
Absolutely, while the suggestion in the post is just about controlling LLMs, this is obviously helpful to any A(G)I approach that uses LLMs, such as all various scafolding approaches. Being able to be a lot more sure of your LLM-simulated agent’s emotions/motivations/behavior/education level/etc is basically always helpful.
Yes. Particularly if you’re already performing an internal review for helpfulness and accuracy. I’ll think more about how those labels could be applied, and include this, with credit to you, in my next post on aligning language model agents.
Much appreciated! Also, I’d love to hear if anyone starts working on stuff along these lines (especially if they were interested in working together).