Thanks for the post! Frankly this is a sub-field of alignment that I have not been following closely, so it is very useful to have a high-level comparative overview.
I have a question about your thoughts on what ‘myopia verification’ means in practice.
Do you see ‘myopia’ as a single well-defined mathematical property that might be mechanically verified by an algorithm that tests the agent? Or is it a more general bucket term that means `bad in a particular way’, where a human might conclude, based on some gut feeling when seeing the output of a transparency tool, that the agent might not be sufficiently myopic?
Well, I don’t think we really know the answer to that question right now. My hope is that myopia will turn out to be a pretty easy to verify property—certainly my guess is that it’ll be easier to verify than non-deception. Until we get better transparency tools, a better understanding of what algorithms our models are actually implementing, and better definitions of myopia that make sense in that context, however, we don’t really know how easy verifying it will be. Maybe it can be done mechanically, maybe it’ll require a human—we still really just don’t know.
Thanks for the post! Frankly this is a sub-field of alignment that I have not been following closely, so it is very useful to have a high-level comparative overview.
I have a question about your thoughts on what ‘myopia verification’ means in practice.
Do you see ‘myopia’ as a single well-defined mathematical property that might be mechanically verified by an algorithm that tests the agent? Or is it a more general bucket term that means `bad in a particular way’, where a human might conclude, based on some gut feeling when seeing the output of a transparency tool, that the agent might not be sufficiently myopic?
What informs this question is that I can’t really tell when I re-read your Towards a mechanistic understanding of corrigibility and the comments there. So I am wondering about your latest thinking.
Well, I don’t think we really know the answer to that question right now. My hope is that myopia will turn out to be a pretty easy to verify property—certainly my guess is that it’ll be easier to verify than non-deception. Until we get better transparency tools, a better understanding of what algorithms our models are actually implementing, and better definitions of myopia that make sense in that context, however, we don’t really know how easy verifying it will be. Maybe it can be done mechanically, maybe it’ll require a human—we still really just don’t know.