Yeah I do want to add—this particular paper I actually agree with yudkowsky is probably a small reduction in P(doom), because it successfully focuses a risky operation in a way that moves towards humans being able to check a system. The dangerous thing would be to be hands off; the more you actually in fact use interpretability to put humans in the loop, the more you get the intended benefits of interpretability. If you remove humans from the loop, you remove your influence on the system, and the system rockets ahead of you and blows up the world, or if not the world, at least your lab.
I do feel just having humans in the loop is not be a complete solution, though. Even if humans look at the process, algorithmic foom could be really really fast. Especially if it is purposely being used to augment the AGI abilities.
Without a strong reason to believe our alignment scheme will be strong enough to support the ability gain (or that the AGI won’t recklessly arbitrarily improve itself), I would avoid letting the AGI look at itself al together. Just make it illegal for AGI labs to use AGIs to look at themselves. Just don’t do it.
Not today. But probably soon enough. We still need the interpretability for safety, but we don’t know how much of that work will generalize to capabilities.
I would have loved if the paper wasn’t using GPT but something more narrow to automate interpretability, but alas. To make sure I am not misunderstood: I think it’s good work that we need, but it does point in a dangerous direction.
Cheers. You comments actually allowed me to fully realize where the danger lies and expand a little on the consequences.
Thanks again for the feedback
Yeah I do want to add—this particular paper I actually agree with yudkowsky is probably a small reduction in P(doom), because it successfully focuses a risky operation in a way that moves towards humans being able to check a system. The dangerous thing would be to be hands off; the more you actually in fact use interpretability to put humans in the loop, the more you get the intended benefits of interpretability. If you remove humans from the loop, you remove your influence on the system, and the system rockets ahead of you and blows up the world, or if not the world, at least your lab.
I do feel just having humans in the loop is not be a complete solution, though. Even if humans look at the process, algorithmic foom could be really really fast. Especially if it is purposely being used to augment the AGI abilities.
Without a strong reason to believe our alignment scheme will be strong enough to support the ability gain (or that the AGI won’t recklessly arbitrarily improve itself), I would avoid letting the AGI look at itself al together. Just make it illegal for AGI labs to use AGIs to look at themselves. Just don’t do it.
Not today. But probably soon enough. We still need the interpretability for safety, but we don’t know how much of that work will generalize to capabilities.
I would have loved if the paper wasn’t using GPT but something more narrow to automate interpretability, but alas. To make sure I am not misunderstood: I think it’s good work that we need, but it does point in a dangerous direction.