Excellent! I’ll digest that and the steering post. I’m also addressing steering system alignment as somewhat of a shortcut in Human preferences as RL critic values—implications for alignment. I think it’s really similar to your approach, as I’m also thinking about a MuZero type of RL system. The map here is the executive function system in an LMCA, but it deserves to be spelled out more thoroughly as you’ve done in your corrigibility tax post.
Excellent! I’ll digest that and the steering post. I’m also addressing steering system alignment as somewhat of a shortcut in Human preferences as RL critic values—implications for alignment. I think it’s really similar to your approach, as I’m also thinking about a MuZero type of RL system. The map here is the executive function system in an LMCA, but it deserves to be spelled out more thoroughly as you’ve done in your corrigibility tax post.