RicG comments on AGI-Automated Interpretability is Suicide

__RicG__ 10 May 2023 15:48 UTC
4 points
1
I do feel just having humans in the loop is not be a complete solution, though. Even if humans look at the process, algorithmic foom could be really really fast. Especially if it is purposely being used to augment the AGI abilities.
Without a strong reason to believe our alignment scheme will be strong enough to support the ability gain (or that the AGI won’t recklessly arbitrarily improve itself), I would avoid letting the AGI look at itself al together. Just make it illegal for AGI labs to use AGIs to look at themselves. Just don’t do it.
Not today. But probably soon enough. We still need the interpretability for safety, but we don’t know how much of that work will generalize to capabilities.
I would have loved if the paper wasn’t using GPT but something more narrow to automate interpretability, but alas. To make sure I am not misunderstood: I think it’s good work that we need, but it does point in a dangerous direction.