As once discussed in person, I find this proposal pretty interesting and I think it deserves further thought.
Like some other commenters, I think for many tasks it’s probably not tractable to develop a fully interpretable, competitive GOFAI program. For example, I would imagine that for playing chess well, one needs to do things like positively evaluating some random-looking feature of a position just on the basis that empirically this feature is associated with higher win rate. However, the approach of the post could be weakened to allow “mixed” programs that have some not so interpretable aspects, e.g., search + a network for evaluating positions is more interpretable than just a network that chooses moves, a search + sum over feature evals is even more interpretable, and so on.
As you say in the post, there seems to be some analogy between your proposal and interpreting a given network. (For interpreting a given chess-playing network, the above impossibility argument also applies. I doubt that a full interpretation of 3600 elo neural nets will ever exist. There’ll always be points where you’d want to ask, “why?”, and the answer is, “well, on average this works well...”) I think if I wanted to make a case for the present approach, I’d mostly try to sell it as a better version of interpretation.
Here’s a very abstract argument. Consider the following two problems:
Given a neural net (or circuit or whatever) for a task, generate an interpretation/explanation (whatever that is exactly, could be a “partial” interpretation) of that neural net.
Given a neural net for a task, generate a computer program that performs the task roughly as well as the given neural net and an interpretation/explanation for this new program.
Interpretability is the first problem. My variant of your suggestion is that we solve the second problem instead. Solving the second problem seems just as useful as solving the first problem. Solving the second problem is at most as hard as solving the first. (If you can solve the first problem, you automatically solve the second problem.)
So actually all we really need to argue is that getting to (use enormous amounts of LLM labor to) write a new program partly from scratch makes the problem strictly easier. And then it’s easy to come up with lots of concrete ideas for cases where it might be easier. For instance, take chess. Then imposing the use of a GOFAI search algorithm to use with a position evaluation network increases interpretability relative to just training an end-to-end model. It also doesn’t hurt performance. (In fact, my understanding is that the SOTA still uses some GOFAI methods, rather than an end-to-end-trained neural net.) You can think of further ways to hard-code-things in a way that simplifies interpretability at small costs to performance. For instance, I’d guess that you can let the LLMs write 1000 different Python functions that detect various features in the position (whether White has the Bishop pair, whether White’s king has three pawns in front of it, etc.). For chess in particular you could of course also just get these functions from prior work on chess engines. Then you feed these into the neural net that you use for evaluating positions. In return, you can presumably make that network smaller (assuming your features are actually useful), while keeping performance constant. This leaves less work for neural interpretation. How much smaller is an empirical question.
I am actually currently working on developing these ideas further, and I expect to relatively soon be able to put out some material on this (modulo the fact that I have to finish my PhD thesis first).
I also think that you in practice probably would have to allow some uninterpretable components to maintain competitive performance, at least in some domains. One reason for this is of course that there simply might not be any interpretable computer program which solves the given task (*). Moreover, even if such a program does exist, it may plausibly be infeasibly difficult to find (even with the help of powerful AI systems). However, some black-box components might be acceptable (depending on how the AI is used, etc), and it seems like partial successes would be useful even if the full version of the problem isn’t solved (at least under the assumption that interpretability is useful, even if the full version of interpretability isn’t solved).
I also think there is good reason to believe that quite a lot of the cognition that humans are capable of can be carried out by interpretable programs. For example, any problem where you can “explain your thought process” or “justify your answer” is probably (mostly) in this category. I also don’t think that operations of the form “do X, because on average, this works well” necessarily are problematic, provided that “X” itself can be understood. Humans give each other advice like this all the time. For example, consider a recommendation like “when solving a maze, it’s often a good idea to start from the end”. I would say that this is interpretable, even without a deeper justification for why this is a good thing to do. At the end of the day, all knowledge must (in some way) be grounded in statistical regularities. If you ask a sequence of “why”-questions, you must eventually hit a point where you are no longer able to answer. As long as the resulting model itself can be understood and reasoned about, I think we should consider this to be a success. This also means that problems that can be solved by a large ensemble of simple heuristics arguably are fine, provided that the heuristics themselves are intelligible.
(*) It is also not fully clear to me if it even makes sense to say that a task can’t be solved by an interpretable program. On an intuitive level, this seems to make sense. However, I’m not able to map this statement onto any kind of formal claim. Would it imply that there are things which are outside the reach of science? I consider it to at least be a live possibility that anything can be made interpretable.
I also don’t think that operations of the form “do X, because on average, this works well” necessarily are problematic, provided that “X” itself can be understood.
Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I’m still not quite sure what exactly “X can be understood” must require. It seems to matter, for example, that to a human it’s understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it’s not doing anything problematic (?).
As once discussed in person, I find this proposal pretty interesting and I think it deserves further thought.
Like some other commenters, I think for many tasks it’s probably not tractable to develop a fully interpretable, competitive GOFAI program. For example, I would imagine that for playing chess well, one needs to do things like positively evaluating some random-looking feature of a position just on the basis that empirically this feature is associated with higher win rate. However, the approach of the post could be weakened to allow “mixed” programs that have some not so interpretable aspects, e.g., search + a network for evaluating positions is more interpretable than just a network that chooses moves, a search + sum over feature evals is even more interpretable, and so on.
As you say in the post, there seems to be some analogy between your proposal and interpreting a given network. (For interpreting a given chess-playing network, the above impossibility argument also applies. I doubt that a full interpretation of 3600 elo neural nets will ever exist. There’ll always be points where you’d want to ask, “why?”, and the answer is, “well, on average this works well...”) I think if I wanted to make a case for the present approach, I’d mostly try to sell it as a better version of interpretation.
Here’s a very abstract argument. Consider the following two problems:
Given a neural net (or circuit or whatever) for a task, generate an interpretation/explanation (whatever that is exactly, could be a “partial” interpretation) of that neural net.
Given a neural net for a task, generate a computer program that performs the task roughly as well as the given neural net and an interpretation/explanation for this new program.
Interpretability is the first problem. My variant of your suggestion is that we solve the second problem instead. Solving the second problem seems just as useful as solving the first problem. Solving the second problem is at most as hard as solving the first. (If you can solve the first problem, you automatically solve the second problem.)
So actually all we really need to argue is that getting to (use enormous amounts of LLM labor to) write a new program partly from scratch makes the problem strictly easier. And then it’s easy to come up with lots of concrete ideas for cases where it might be easier. For instance, take chess. Then imposing the use of a GOFAI search algorithm to use with a position evaluation network increases interpretability relative to just training an end-to-end model. It also doesn’t hurt performance. (In fact, my understanding is that the SOTA still uses some GOFAI methods, rather than an end-to-end-trained neural net.) You can think of further ways to hard-code-things in a way that simplifies interpretability at small costs to performance. For instance, I’d guess that you can let the LLMs write 1000 different Python functions that detect various features in the position (whether White has the Bishop pair, whether White’s king has three pawns in front of it, etc.). For chess in particular you could of course also just get these functions from prior work on chess engines. Then you feed these into the neural net that you use for evaluating positions. In return, you can presumably make that network smaller (assuming your features are actually useful), while keeping performance constant. This leaves less work for neural interpretation. How much smaller is an empirical question.
I am actually currently working on developing these ideas further, and I expect to relatively soon be able to put out some material on this (modulo the fact that I have to finish my PhD thesis first).
I also think that you in practice probably would have to allow some uninterpretable components to maintain competitive performance, at least in some domains. One reason for this is of course that there simply might not be any interpretable computer program which solves the given task (*). Moreover, even if such a program does exist, it may plausibly be infeasibly difficult to find (even with the help of powerful AI systems). However, some black-box components might be acceptable (depending on how the AI is used, etc), and it seems like partial successes would be useful even if the full version of the problem isn’t solved (at least under the assumption that interpretability is useful, even if the full version of interpretability isn’t solved).
I also think there is good reason to believe that quite a lot of the cognition that humans are capable of can be carried out by interpretable programs. For example, any problem where you can “explain your thought process” or “justify your answer” is probably (mostly) in this category. I also don’t think that operations of the form “do X, because on average, this works well” necessarily are problematic, provided that “X” itself can be understood. Humans give each other advice like this all the time. For example, consider a recommendation like “when solving a maze, it’s often a good idea to start from the end”. I would say that this is interpretable, even without a deeper justification for why this is a good thing to do. At the end of the day, all knowledge must (in some way) be grounded in statistical regularities. If you ask a sequence of “why”-questions, you must eventually hit a point where you are no longer able to answer. As long as the resulting model itself can be understood and reasoned about, I think we should consider this to be a success. This also means that problems that can be solved by a large ensemble of simple heuristics arguably are fine, provided that the heuristics themselves are intelligible.
(*) It is also not fully clear to me if it even makes sense to say that a task can’t be solved by an interpretable program. On an intuitive level, this seems to make sense. However, I’m not able to map this statement onto any kind of formal claim. Would it imply that there are things which are outside the reach of science? I consider it to at least be a live possibility that anything can be made interpretable.
Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I’m still not quite sure what exactly “X can be understood” must require. It seems to matter, for example, that to a human it’s understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it’s not doing anything problematic (?).