TristanTrim comments on Internal independent review for language model agent alignment

TristanTrim 1 Dec 2024 10:19 UTC
3 points
0

The organized mind recoils. This is not an aesthetically appealing alignment approach.

Praise Eris!

No, but seriously, I like this plan with the caveat that we really need to understand RSI and what is required to prevent it first, and also I think the temptation to allow these things to open up high bandwidth channels to other modalities than language is going to be really really strong and if we go forward with this we need a good plan to resist that temptation and a good way to know when not to resist that temptation.

Also, I’d like it if this was though of as a step on the path to cyborgism/true value alignment, and not as a true ASI alignment plan on its own.
- Seth Herd 1 Dec 2024 22:31 UTC
  2 points
  0
  Parent
  On RSI, see The alignment stability problem and my response to your comment on Instruction-following AGI...
  WRT true value alignment, I agree that this is just a stepping stone to that better sort of alignment. See Intent alignment as a stepping-stone to value alignment.
  I agree that including non-linguistic channels is going to be a strong temptation. Language does nicely summarize most of our really abstract thought, so I don’t think it’s necessary. But there are many training practices that would destroy the legible chain of thought needed for external review. See the case for CoT unfaithfulness is overstated for the inverse.
  Legible CoT is actually not necessary for internal action review. You do need to be able to parse what the action is for another model to predict and review its likely consequences. And it works far better to review things at a plan level rather than action-by-action, so the legible CoT is very useful. But if the system is still trained to respond to prompts, you could still use the scripted internal review no matter how opaque the internal representations had become. But you couldn’t really make that review independent if you didn’t have a way to summarize the plan so it could be passed to another model, like you can with language.
  BTW your comment accidentally was formatted as a quote along with the bit you meant to quote from the post. Correcting that would make it easier for others to parse, but it was clear to me.
  - TristanTrim 2 Dec 2024 0:23 UTC
    1 point
    0
    Parent
    WRT formatting, thanks I didn’t realise the markdown needs two new lines for a paragraph break.
    
    I think CoT and its dynamics as it relates to review and RSI is very interesting & useful to be exploring.
    
    Looking forward to reading the stepping stone and stability posts you linked. : )