it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them
I’m not sure where you’re getting that. The thing I described in my last comment did not include the humans analyzing the AI’s plans, it only involved the humans labeling YouTube videos.
It would be lovely if humans could reliably analyze the AI’s plans. But I fear that our interpretability techniques will not be up to that challenge.
we will have no way to determine if its thought assessor generalizes wrongly
I’m not sure where you’re getting that. The thing I described in my last comment did not include the humans analyzing the AI’s plans, it only involved the humans labeling YouTube videos.
It would be lovely if humans could reliably analyze the AI’s plans. But I fear that our interpretability techniques will not be up to that challenge.
I agree, see §14.4.
Ah, sorry, I misunderstood you.