NB: At the moment I forget where I have this idea in my head from, so I’ll try to reexplain it from scratch.
There’s a well known idea that external behavior can only imply internal structure if you make assumptions about what the internal structure is like. This is kind of obvious when you think about it, but just to make the point clear, suppose we have a black box with a red button on top of it and a wire coming out the side. When you press the button, the wire carries a current, and then it goes dead when the button is not pressed.
You might be tempted to say it’s just a switch and the button is closing a circuit that transmits electricity on the wire, but that’s not the only option. It could be that when you press the button a radio signal is sent to a nearby receiver, that receiver is hooked up to another circuit that does some computations, and then it sends back a signal to a receiver in the box that tells it to close the circuit by powering a solenoid switch.
You can’t tell these two cases apart, or in fact infinitely many other cases, just from the observed external behavior of the system. If you want to claim that it’s “just a switch” then you have to do something like assume it’s implemented in the simplest way possible with no extra stuff going on.
This point seems super relevant for AI because much of the concern is that there’s hidden internal structure that doesn’t reveal itself in observable behavior (the risk of a sharp left turn). So to the extent this line of research seems useful, it’s worth keeping in mind that by itself it will necessary have a huge blind spot with regards to how a system works.
I think this is a good point. I would push back a small amount on being unable to tell the difference between those two cases. There is more information you can extract from the system, like the amount of time that it takes after pressing the button for the current to turn on. But In general, I agree.
I agree that it would be very easy to have huge blind spots regarding this line of inquiry. This is the thing I worry about most. But I do have a hunch that given enough data about a system and its capabilities, we can make strong claims about its internal structure, and these structures will yield predictive power.
When you have little information like “pressing this button makes this wire turn on,” it is much harder to do this. However, I believe testing an unknown program in many different environments and having information like its run time and size can narrow the space of possibilities sufficiently to say something useful.
NB: At the moment I forget where I have this idea in my head from, so I’ll try to reexplain it from scratch.
There’s a well known idea that external behavior can only imply internal structure if you make assumptions about what the internal structure is like. This is kind of obvious when you think about it, but just to make the point clear, suppose we have a black box with a red button on top of it and a wire coming out the side. When you press the button, the wire carries a current, and then it goes dead when the button is not pressed.
You might be tempted to say it’s just a switch and the button is closing a circuit that transmits electricity on the wire, but that’s not the only option. It could be that when you press the button a radio signal is sent to a nearby receiver, that receiver is hooked up to another circuit that does some computations, and then it sends back a signal to a receiver in the box that tells it to close the circuit by powering a solenoid switch.
You can’t tell these two cases apart, or in fact infinitely many other cases, just from the observed external behavior of the system. If you want to claim that it’s “just a switch” then you have to do something like assume it’s implemented in the simplest way possible with no extra stuff going on.
This point seems super relevant for AI because much of the concern is that there’s hidden internal structure that doesn’t reveal itself in observable behavior (the risk of a sharp left turn). So to the extent this line of research seems useful, it’s worth keeping in mind that by itself it will necessary have a huge blind spot with regards to how a system works.
I think this is a good point. I would push back a small amount on being unable to tell the difference between those two cases. There is more information you can extract from the system, like the amount of time that it takes after pressing the button for the current to turn on. But In general, I agree.
I agree that it would be very easy to have huge blind spots regarding this line of inquiry. This is the thing I worry about most. But I do have a hunch that given enough data about a system and its capabilities, we can make strong claims about its internal structure, and these structures will yield predictive power.
When you have little information like “pressing this button makes this wire turn on,” it is much harder to do this. However, I believe testing an unknown program in many different environments and having information like its run time and size can narrow the space of possibilities sufficiently to say something useful.