Model-based RL is more naturally interpretable than end-to-end-trained systems because there’s a data structure called “world-model” and a data structure called “value function”, and maybe each of those data structures is individually inscrutable, but that’s still a step in the right direction compared to having them mixed together into just one data structure. For example, it’s central to this proposal.
More generally, we don’t really know much about what other kinds of hooks and modularity can be put into a realistic AI. You say “probably not possible” but I don’t think the “probably” is warranted. Evolution wasn’t going for human-interpretability, so we just don’t know either way. I would have said “We should be open to the possibility that giant inscrutable matrices are the least-bad of all possible worlds” or something.
If Architecture A can represent the same capabilities as Architecture B with fewer unlabeled nodes (and maybe a richer space of relationships between nodes), then that’s a step in the right direction.
I think you’re saying that “asynchronous” neural networks (like in biology) are more inscrutable than “synchronous” matrix multiplication, but I don’t think that claim is really based on anything, except for your intuitions, but your intuitions are biased by the fact that you’re never tried to interpret an “asynchronous” neural network, which in turn is closely tied to the fact that nobody knows how to program one that works. Actually, my hunch is that the asynchronicity is an implementation detail that could be easily abstracted away.
If we think of interpretability as a UI into the trained model, then the problem is really to simultaneously co-design a learning algorithm & interpretability approach that work together and dynamically scale up to sufficient AI intelligence. I think you would describe success at that design problem as “Ha! The inscrutable matrix approach worked after all!”, and that Eliezer would describe success at that same design problem as “Ha! We figured out a way to build AI without giant inscrutable matrices!” (The matrices are giant but not inscrutable.)
Model-based RL is more naturally interpretable than end-to-end-trained systems because there’s a data structure called “world-model” and a data structure called “value function”, and maybe each of those data structures is individually inscrutable, but that’s still a step in the right direction compared to having them mixed together into just one data structure. For example, it’s central to this proposal.
More generally, we don’t really know much about what other kinds of hooks and modularity can be put into a realistic AI. You say “probably not possible” but I don’t think the “probably” is warranted. Evolution wasn’t going for human-interpretability, so we just don’t know either way. I would have said “We should be open to the possibility that giant inscrutable matrices are the least-bad of all possible worlds” or something.
If Architecture A can represent the same capabilities as Architecture B with fewer unlabeled nodes (and maybe a richer space of relationships between nodes), then that’s a step in the right direction.
I think you’re saying that “asynchronous” neural networks (like in biology) are more inscrutable than “synchronous” matrix multiplication, but I don’t think that claim is really based on anything, except for your intuitions, but your intuitions are biased by the fact that you’re never tried to interpret an “asynchronous” neural network, which in turn is closely tied to the fact that nobody knows how to program one that works. Actually, my hunch is that the asynchronicity is an implementation detail that could be easily abstracted away.
If we think of interpretability as a UI into the trained model, then the problem is really to simultaneously co-design a learning algorithm & interpretability approach that work together and dynamically scale up to sufficient AI intelligence. I think you would describe success at that design problem as “Ha! The inscrutable matrix approach worked after all!”, and that Eliezer would describe success at that same design problem as “Ha! We figured out a way to build AI without giant inscrutable matrices!” (The matrices are giant but not inscrutable.)