As an established case for tractability, we have the natural abstraction hypothesis. According to it, efficient abstractions are a feature of the territory, not the map (at least to a certain significant extent). Thus, we should expect different AI models to converge towards the same concepts, which also would make sense to us. Either because we’re already using them (if the AI is trained on a domain we understand well), or because they’d be the same abstractions we’d arrive at ourselves (if it’s a novel domain).
Even believing in a relatively strong version of the natural abstractions hypothesis doesn’t (on its own) imply that we should be able to understand all concepts the AI uses. Just the ones which:
have natural abstractions
that the ai faithfully learns as opposed to devoting insufficient capacity to reach the natural abstraction
and humans can understand these natural abstractions
These three properties seem reasonably likely in practice for some common stuff like ‘trees’ or ‘dogs’.
Even believing in a relatively strong version of the natural abstractions hypothesis doesn’t (on its own) imply that we should be able to understand all concepts the AI uses. Just the ones which:
have natural abstractions
that the ai faithfully learns as opposed to devoting insufficient capacity to reach the natural abstraction
and humans can understand these natural abstractions
These three properties seem reasonably likely in practice for some common stuff like ‘trees’ or ‘dogs’.