I greatly appreciate writing your thoughts up. I have a few questions about your agenda/optimism regarding particular approaches.
The type of transparency that I’m most excited about is mechanistic, in a sense that I’ve described elsewhere.
Let me know if you’d agree with the following. The mechanistic approach is about understanding the internal structure of a program and how it behaves on arbitrary inputs. Mechanistic transparency is quite different from the more typical meaning of interpretability where we would like to know why an AI did something on a particular input.
We could consider the following algorithms mechanistically transparent:
A small decision tree
The minimax algorithm
An explicit expected utility maximizer with a simple understandable utility function
Quicksort
We could consider the following algorithms interpretable but not necessarily mechanistically transparent:
A large decision tree
k-nearest neighbors on a 2 dimensional input
A human who is asked to show their work on an exam
I have two main questions:
First, it seems like algorithms that are mechanistically transparent mainly derive their transparency from having a simple core mathematical backbone. But as Wei Dai pointed out, “My guess is that if you took a human-level AGI that was the result of something like deep learning optimizing only for capability (and not understandability), and tried to interpret it as pseudocode, you’ll end up with so many modules with so many interactions between them that no human or team of humans could understand it. In other words, you’ll end up with spaghetti code written by a superintelligence (meaning the training process).” I am finding it hard to believe that there will be a simple basin that we can regularize an AGI into. Do you agree? If so, why do you think that the mechanistic approach is more promising?
Second, why is mechanistic transparency important in the first place? What places do you concretely see it being helpful for understanding how systems work, specifically with respect to alignment?
To understand why I’m asking the question better, let’s imagine a human (who is interpretable but not mechanistically transparent) and a mechanistically transparent minimax robot playing a game of chess. In the midgame, I ask the human why they moved their queen into the enemy’s territory.
“Do you see a route to checkmate from here?” I ask. “No. I just wanted to get more aggressive. I am setting up to move my bishop in next, and I will try to see if I can force a defeat from there.”
The robot responds by moving their rook forward, and I ask the robot why they did that. They reply, “I analyzed 918912 moves and countermoves and discovered that this one had the minimum possible loss out of all possible countermoves from my opponent, using this scoring system for the loss.”
Now, I ask, if we wanted to learn what mistakes each algorithm was making, what type of transparency helps more in your opinion?
Let me know if you’d agree with the following. The mechanistic approach is about understanding the internal structure of a program and how it behaves on arbitrary inputs. Mechanistic transparency is quite different from the more typical meaning of interpretability where we would like to know why an AI did something on a particular input.
I agree with your sentence about the mechanistic approach. I think the word “interpretable” has very little specific meaning, but most work is about particular inputs. I agree that your examples divide up into what I would consider mechanistically transparent vs not, depending on exactly how large the decision tree, but I can’t speak to whether they all count as “interpretable”.
First, it seems like algorithms that are mechanistically transparent mainly derive their transparency from having a simple core mathematical backbone. But as Wei Dai pointed out, “My guess is that if you took a human-level AGI that was the result of something like deep learning optimizing only for capability (and not understandability), and tried to interpret it as pseudocode, you’ll end up with so many modules with so many interactions between them that no human or team of humans could understand it. In other words, you’ll end up with spaghetti code written by a superintelligence (meaning the training process).” I am finding it hard to believe that there will be a simple basin that we can regularize an AGI into. Do you agree? If so, why do you think that the mechanistic approach is more promising?
I think it’s plausible that there will be a simple basin that we can regularise an AGI into, because I have some ideas about how to do it, and because the world hasn’t thought very hard about the problem yet (meaning the lack of extant solutions is to some extent explained away). I also think that there exists a relatively simple mathematical backbone to intelligence to be found (but not that all intelligent systems have this backbone), because I think promising progress has been made in mathematising a bunch of relevant concepts (see probability theory, utility theory, AIXI, reflective oracles). But this might be a bias from ‘growing up’ academically in Marcus Hutter’s lab.
Second, why is mechanistic transparency important in the first place? What places do you concretely see it being helpful for understanding how systems work, specifically with respect to alignment?
You haven’t deployed a system, don’t know the kinds of situations it might encounter, and want reason to believe that it will perform well (e.g. by not trying to kill everyone) in these situations that you can’t simulate. That being said, I have the feeling that this answer isn’t satisfactorily detailed, so maybe you want more detail, or are thinking of a critique I haven’t thought of?
To understand why I’m asking the question better, let’s imagine a human (who is interpretable but not mechanistically transparent) and a mechanistically transparent minimax robot playing a game of chess. In the midgame, I ask the human why they moved their queen into the enemy’s territory.
“Do you see a route to checkmate from here?” I ask. “No. I just wanted to get more aggressive. I am setting up to move my bishop in next, and I will try to see if I can force a defeat from there.”
The robot responds by moving their rook forward, and I ask the robot why they did that. They reply, “I analyzed 918912 moves and countermoves and discovered that this one had the minimum possible loss out of all possible countermoves from my opponent, using this scoring system for the loss.”
Now, I ask, if we wanted to learn what mistakes each algorithm was making, what type of transparency helps more in your opinion?
In this situation, the first answer is more likely to reveal some specific high-level mistakes the player might make, and provides affordance for a chess player to give advice for how to improve. The second answer seems like it’s more amenable to mathematical analysis, generalises better across boards, less likely to be confabulated, and provides a better handle for how to directly improve the algorithm (basically, read forward more than one move). So I guess the first answer better reveals chess mistakes, and the second better reveals cognitive mistakes.
I think it’s plausible that there will be a simple basin that we can regularise an AGI into, because I have some ideas about how to do it, and because the world hasn’t thought very hard about the problem yet (meaning the lack of extant solutions is to some extent explained away).
That makes sense. More pessimistically, one could imagine that the reason why no one has thought very hard about it is because in practice, it doesn’t really help you that much to have a mechanistic understanding of a neural network in order to do useful work. Though perhaps as AI becomes more ‘agentic’ you think that will cease to be the case?
I also think that there exists a relatively simple mathematical backbone to intelligence to be found (but not that all intelligent systems have this backbone), because I think promising progress has been made in mathematising a bunch of relevant concepts (see probability theory, utility theory, AIXI, reflective oracles). But this might be a bias from ‘growing up’ academically in Marcus Hutter’s lab.
I had read your comment thread on Realism about Rationality a while back, and I was under the impression that your stance was something like “rationality is as real as liberalism” or something like that. A relatively simple backbone in the same ballpark as probability theory, utility theory etc. seems way more realist than that.
I also have an intuition for why focusing on these mathematical theories might bias us towards thinking that intelligence can be described mathematically, but it’s a difficult intuition to convey, so bear with me.
First, an observation: the reason why the simple theories of intelligence don’t produce intelligence in practice is because direct computations for them are extremely expensive. There are ways to reduce the compute draw for them to work, but the “things you do to increase compute efficiency of intelligence” is arguably the hardest part about building intelligent machines, and the part that makes up the majority of conceptual space for understanding them. Therefore, understanding real-world intelligent machines requires mostly understanding the tricks they do to be compute-efficient, rather than understanding the mathematical underpinnings.
This intuition is a bit vague, but maybe you saw what I was trying to say?
That being said, I have the feeling that this answer isn’t satisfactorily detailed, so maybe you want more detail, or are thinking of a critique I haven’t thought of?
I care primarily about AI deception at the moment, and I suspect the biggest reason an AI would deceive us is because it received an input that was off-distribution that caused it to act weird. Input-specific interpretability allows us to detect those cases when they arise. Mechanistic transparency might help, but only if the mathematical description of the AI is amenable to real-world analysis.
Most likely, a mathematical description will be long and complex, and the developers will have to pay a high cost to understand how the description could imply deception (But given what you said above about a simple basin, I think this is probably a crux).
I’ll just respond to the easy part of this for now.
I had read your comment thread on Realism about Rationality a while back, and I was under the impression that your stance was something like “rationality is as real as liberalism” or something like that. A relatively simple backbone in the same ballpark as probability theory, utility theory etc. seems way more realist than that.
That’s not what I said. Because it takes ages to scroll down to comments and I’m on my phone, I can’t easily link to the relevant comments, but basically I said that rationality is probably as formalisable as electromagnetism, but that theories as precise as that of liberalism can still be reasoned about and built on.
More pessimistically, one could imagine that the reason why no one has thought very hard about it is because in practice, it doesn’t really help you that much to have a mechanistic understanding of a neural network in order to do useful work.
I think I just think the ‘market’ here is ‘inefficient’? Like, I think this just isn’t a thing that people have really thought of, and those that have have gained semi-useful insight into neural networks by doing similar things (e.g. figuring out that adding a picture of a baseball to a whale fin will cause a network to misclassify the image as a great white shark). It also seems to me that recognition tasks (as opposed to planning/reasoning tasks) are going to be the hardest to get this kind of mechanistic transparency for, and also the kinds of tasks where transparency is easiest and ML systems are best.
Therefore, understanding real-world intelligent machines requires mostly understanding the tricks they do to be compute-efficient, rather than understanding the mathematical underpinnings.
I think I understand what you mean here, but also think that there can be tricks that reduce computational cost that have some sort of mathematical backbone—it seems to me that this is common in the study of algorithms. Note also that we don’t have to understand all possible real-world intelligent machines, just the ones that we build, making the requirement less stringent.
I greatly appreciate writing your thoughts up. I have a few questions about your agenda/optimism regarding particular approaches.
Let me know if you’d agree with the following. The mechanistic approach is about understanding the internal structure of a program and how it behaves on arbitrary inputs. Mechanistic transparency is quite different from the more typical meaning of interpretability where we would like to know why an AI did something on a particular input.
We could consider the following algorithms mechanistically transparent:
A small decision tree
The minimax algorithm
An explicit expected utility maximizer with a simple understandable utility function
Quicksort
We could consider the following algorithms interpretable but not necessarily mechanistically transparent:
A large decision tree
k-nearest neighbors on a 2 dimensional input
A human who is asked to show their work on an exam
I have two main questions:
First, it seems like algorithms that are mechanistically transparent mainly derive their transparency from having a simple core mathematical backbone. But as Wei Dai pointed out, “My guess is that if you took a human-level AGI that was the result of something like deep learning optimizing only for capability (and not understandability), and tried to interpret it as pseudocode, you’ll end up with so many modules with so many interactions between them that no human or team of humans could understand it. In other words, you’ll end up with spaghetti code written by a superintelligence (meaning the training process).” I am finding it hard to believe that there will be a simple basin that we can regularize an AGI into. Do you agree? If so, why do you think that the mechanistic approach is more promising?
Second, why is mechanistic transparency important in the first place? What places do you concretely see it being helpful for understanding how systems work, specifically with respect to alignment?
To understand why I’m asking the question better, let’s imagine a human (who is interpretable but not mechanistically transparent) and a mechanistically transparent minimax robot playing a game of chess. In the midgame, I ask the human why they moved their queen into the enemy’s territory.
“Do you see a route to checkmate from here?” I ask. “No. I just wanted to get more aggressive. I am setting up to move my bishop in next, and I will try to see if I can force a defeat from there.”
The robot responds by moving their rook forward, and I ask the robot why they did that. They reply, “I analyzed 918912 moves and countermoves and discovered that this one had the minimum possible loss out of all possible countermoves from my opponent, using this scoring system for the loss.”
Now, I ask, if we wanted to learn what mistakes each algorithm was making, what type of transparency helps more in your opinion?
I agree with your sentence about the mechanistic approach. I think the word “interpretable” has very little specific meaning, but most work is about particular inputs. I agree that your examples divide up into what I would consider mechanistically transparent vs not, depending on exactly how large the decision tree, but I can’t speak to whether they all count as “interpretable”.
I think it’s plausible that there will be a simple basin that we can regularise an AGI into, because I have some ideas about how to do it, and because the world hasn’t thought very hard about the problem yet (meaning the lack of extant solutions is to some extent explained away). I also think that there exists a relatively simple mathematical backbone to intelligence to be found (but not that all intelligent systems have this backbone), because I think promising progress has been made in mathematising a bunch of relevant concepts (see probability theory, utility theory, AIXI, reflective oracles). But this might be a bias from ‘growing up’ academically in Marcus Hutter’s lab.
You haven’t deployed a system, don’t know the kinds of situations it might encounter, and want reason to believe that it will perform well (e.g. by not trying to kill everyone) in these situations that you can’t simulate. That being said, I have the feeling that this answer isn’t satisfactorily detailed, so maybe you want more detail, or are thinking of a critique I haven’t thought of?
In this situation, the first answer is more likely to reveal some specific high-level mistakes the player might make, and provides affordance for a chess player to give advice for how to improve. The second answer seems like it’s more amenable to mathematical analysis, generalises better across boards, less likely to be confabulated, and provides a better handle for how to directly improve the algorithm (basically, read forward more than one move). So I guess the first answer better reveals chess mistakes, and the second better reveals cognitive mistakes.
That makes sense. More pessimistically, one could imagine that the reason why no one has thought very hard about it is because in practice, it doesn’t really help you that much to have a mechanistic understanding of a neural network in order to do useful work. Though perhaps as AI becomes more ‘agentic’ you think that will cease to be the case?
I had read your comment thread on Realism about Rationality a while back, and I was under the impression that your stance was something like “rationality is as real as liberalism” or something like that. A relatively simple backbone in the same ballpark as probability theory, utility theory etc. seems way more realist than that.
I also have an intuition for why focusing on these mathematical theories might bias us towards thinking that intelligence can be described mathematically, but it’s a difficult intuition to convey, so bear with me.
First, an observation: the reason why the simple theories of intelligence don’t produce intelligence in practice is because direct computations for them are extremely expensive. There are ways to reduce the compute draw for them to work, but the “things you do to increase compute efficiency of intelligence” is arguably the hardest part about building intelligent machines, and the part that makes up the majority of conceptual space for understanding them. Therefore, understanding real-world intelligent machines requires mostly understanding the tricks they do to be compute-efficient, rather than understanding the mathematical underpinnings.
This intuition is a bit vague, but maybe you saw what I was trying to say?
I care primarily about AI deception at the moment, and I suspect the biggest reason an AI would deceive us is because it received an input that was off-distribution that caused it to act weird. Input-specific interpretability allows us to detect those cases when they arise. Mechanistic transparency might help, but only if the mathematical description of the AI is amenable to real-world analysis.
Most likely, a mathematical description will be long and complex, and the developers will have to pay a high cost to understand how the description could imply deception (But given what you said above about a simple basin, I think this is probably a crux).
I’ll just respond to the easy part of this for now.
That’s not what I said. Because it takes ages to scroll down to comments and I’m on my phone, I can’t easily link to the relevant comments, but basically I said that rationality is probably as formalisable as electromagnetism, but that theories as precise as that of liberalism can still be reasoned about and built on.
That’s fair. I didn’t actually quite understand what your position was and was trying to clarify.
FWIW I take this work on ‘circuits’ in an image recognition CNN to be a bullish indicator for the possibility of mechanistic transparency.
I think I just think the ‘market’ here is ‘inefficient’? Like, I think this just isn’t a thing that people have really thought of, and those that have have gained semi-useful insight into neural networks by doing similar things (e.g. figuring out that adding a picture of a baseball to a whale fin will cause a network to misclassify the image as a great white shark). It also seems to me that recognition tasks (as opposed to planning/reasoning tasks) are going to be the hardest to get this kind of mechanistic transparency for, and also the kinds of tasks where transparency is easiest and ML systems are best.
I think I understand what you mean here, but also think that there can be tricks that reduce computational cost that have some sort of mathematical backbone—it seems to me that this is common in the study of algorithms. Note also that we don’t have to understand all possible real-world intelligent machines, just the ones that we build, making the requirement less stringent.