Seth Herd comments on Internal independent review for language model agent alignment

Seth Herd 1 Dec 2024 22:31 UTC
2 points
0
On RSI, see The alignment stability problem and my response to your comment on Instruction-following AGI...
WRT true value alignment, I agree that this is just a stepping stone to that better sort of alignment. See Intent alignment as a stepping-stone to value alignment.
I agree that including non-linguistic channels is going to be a strong temptation. Language does nicely summarize most of our really abstract thought, so I don’t think it’s necessary. But there are many training practices that would destroy the legible chain of thought needed for external review. See the case for CoT unfaithfulness is overstated for the inverse.
Legible CoT is actually not necessary for internal action review. You do need to be able to parse what the action is for another model to predict and review its likely consequences. And it works far better to review things at a plan level rather than action-by-action, so the legible CoT is very useful. But if the system is still trained to respond to prompts, you could still use the scripted internal review no matter how opaque the internal representations had become. But you couldn’t really make that review independent if you didn’t have a way to summarize the plan so it could be passed to another model, like you can with language.
BTW your comment accidentally was formatted as a quote along with the bit you meant to quote from the post. Correcting that would make it easier for others to parse, but it was clear to me.
What links here?
- Seth Herd's comment on Internal independent review for language model agent alignment by Seth Herd (1 Dec 2024 22:52 UTC; 2 points)
- TristanTrim 2 Dec 2024 0:23 UTC
  1 point
  0
  Parent
  WRT formatting, thanks I didn’t realise the markdown needs two new lines for a paragraph break.
  
  I think CoT and its dynamics as it relates to review and RSI is very interesting & useful to be exploring.
  
  Looking forward to reading the stepping stone and stability posts you linked. : )