Oh this is a great complication—you highlight why mental moves, like “reflection,” can lead to potential loopholes and complications. Regardless of whether it’s a necessary or less central part of research, as you suggest, self-modifying goal-finding is always a potential issue in AI alignment. I appreciate the notion of “noticeable lack.” This kind of thinking pushes us to take stock of how and whether AIs actually are doing useful alignment research with benign seeming training setups.
Is it *noticeably* lacking or clearing an expected bar? This nuance is less about quantity or quality than it is about expectation—*do we expect it to work this well?* Or, do we expect more extreme directions need to be managed? This is the kind of expectation that I think builds stronger theory. Great food for thought in your reply too. Consideration of model differences between yourself and others is super important! Have you considered trying to synthesize between Nate and your own viewpoints? It might be a powerful thing for expectations and approaches.
Oh this is a great complication—you highlight why mental moves, like “reflection,” can lead to potential loopholes and complications. Regardless of whether it’s a necessary or less central part of research, as you suggest, self-modifying goal-finding is always a potential issue in AI alignment. I appreciate the notion of “noticeable lack.” This kind of thinking pushes us to take stock of how and whether AIs actually are doing useful alignment research with benign seeming training setups.
Is it *noticeably* lacking or clearing an expected bar? This nuance is less about quantity or quality than it is about expectation—*do we expect it to work this well?* Or, do we expect more extreme directions need to be managed? This is the kind of expectation that I think builds stronger theory. Great food for thought in your reply too. Consideration of model differences between yourself and others is super important! Have you considered trying to synthesize between Nate and your own viewpoints? It might be a powerful thing for expectations and approaches.