This seems somewhat related to this article but I came across this paper (Human Shared AI control via Policy Dissection) which uses neural frequency analysis of behaviours from an rl policy to control the agents actions. I am wondering if the same thing can be done with language models. Maybe this same technique can also be useful in finding vectors that do specific things.
This seems somewhat related to this article but I came across this paper (Human Shared AI control via Policy Dissection) which uses neural frequency analysis of behaviours from an rl policy to control the agents actions. I am wondering if the same thing can be done with language models. Maybe this same technique can also be useful in finding vectors that do specific things.