Our pragmatic, grassroots, hands-on approach to AI alignment starts with applying technical psychotherapeutic techniques to the task of large language model fine-tuning.
I guess I’m glad someone’s trying this approach, but I also think it’s unlikely to pan out. Mainly, I expect the techniques you’re looking to apply to have a lot of hidden assumptions about human brain architecture and typical human experiences, which won’t be reflected in the much-more-alien stuff that’s present inside an LLM’s weights. The results might still be interesting, but I think your highest priority will be to avoid fooling yourself into treating LLMs as more human than they are, which is a problem that people are running into.
These are the points I need to hear as a researcher approaching alignment from an alien field! One reason I think it’s worth trying is client-centered therapy inherently preserves agency on the part of the model...
From the site:
I guess I’m glad someone’s trying this approach, but I also think it’s unlikely to pan out. Mainly, I expect the techniques you’re looking to apply to have a lot of hidden assumptions about human brain architecture and typical human experiences, which won’t be reflected in the much-more-alien stuff that’s present inside an LLM’s weights. The results might still be interesting, but I think your highest priority will be to avoid fooling yourself into treating LLMs as more human than they are, which is a problem that people are running into.
These are the points I need to hear as a researcher approaching alignment from an alien field! One reason I think it’s worth trying is client-centered therapy inherently preserves agency on the part of the model...