Interesting that you think it would work most of the time. I know you’re aware of all the major arguments for alignment being impossibly hard. I certainly am not arguing that alignment is easy, but it does seem like the collection of ideas for aligning language model agents are viable enough to shift the reasonable distribution of estimates of alignment difficulty...
Interesting that you think it would work most of the time. I know you’re aware of all the major arguments for alignment being impossibly hard. I certainly am not arguing that alignment is easy, but it does seem like the collection of ideas for aligning language model agents are viable enough to shift the reasonable distribution of estimates of alignment difficulty...
Thanks for the references, I’m reading them now.