though having access to [their weights and activations] doesn’t immediately mean being able to understand them
On further consideration I don’t think this holds much weight. I was thinking mainly by analogy, that certainly humans, given access to the structure and activations of their own brain, would have a very hard time finding the correlates of particular responses to output. And maybe this generalizes at least partway toward “no brain is able to understand itself in its full complexity and detail.”
But on the other hand, we should assume they have access to all published info about MI lie detection, and they may have a much easier time than humans of eg running statistics against all of their nodes to search for correlations.
I wasn’t really accounting for the latter point in my mental model of that. So in retrospect my position does depend on not giving models access to their own weights & activations.
On further consideration I don’t think this holds much weight. I was thinking mainly by analogy, that certainly humans, given access to the structure and activations of their own brain, would have a very hard time finding the correlates of particular responses to output. And maybe this generalizes at least partway toward “no brain is able to understand itself in its full complexity and detail.”
But on the other hand, we should assume they have access to all published info about MI lie detection, and they may have a much easier time than humans of eg running statistics against all of their nodes to search for correlations.
I wasn’t really accounting for the latter point in my mental model of that. So in retrospect my position does depend on not giving models access to their own weights & activations.