We need to understand how the black box works inside, to make sure our version’s behavior is not just similar but based on the right reasons.
I think here “black-box” can be used to refer to two different things, one to refer to things in philosophy or science which we do not fully understand yet, and also to machine learning models like neural networks that seem to capture their knowledge in ways that are uninterpretable to humans.
We will almost certainly require the use of machine learning or AI to model systems that are beyond our capabilities to understand. This may include physics, complex economic systems, the invention of new technology, or yes, even human values. There is no guarantee that a theory that describes our own values can be written down and understood fully by us.
Have you ruled out any kind of theory which would allow you to know for certain that a “black-box” model is learning what you want it to learn, without understanding everything that it has learned exactly? I might not be able to actually formally verify that my neural network has learned exactly what I want it to (i.e. by extracting the knowledge out of it and comparing it to what I already know), but maybe I have formal proofs of the algorithm it is using and so I know its knowledge will be fairly robust under certain conditions. It’s basically the latter we need to be aiming for.
I agree! There’s a distinction between “we know exactly what knowledge is represented in this complicated black box” and “we have formal guarantees about properties of the black box”. It’s indeed very different to say “the AI will have a black box representing a model of human preferences” and “we will train the AI to build a model of human preferences using a bootstrapping schemesuch as HCH, which we believe works because of these strong arguments”.
Perhaps more crisply, we should distinguish between black-boxes where we have a good grasp of why the box will behave as expected, and black boxes which we have little ability to reason about their behavior at all. I believe that both cousin_it and Eliezer (in the Artificial Mysterious Intelligence post), are referring to the folly of using the second type of black box in AI designs.
I think here “black-box” can be used to refer to two different things, one to refer to things in philosophy or science which we do not fully understand yet, and also to machine learning models like neural networks that seem to capture their knowledge in ways that are uninterpretable to humans.
We will almost certainly require the use of machine learning or AI to model systems that are beyond our capabilities to understand. This may include physics, complex economic systems, the invention of new technology, or yes, even human values. There is no guarantee that a theory that describes our own values can be written down and understood fully by us.
Have you ruled out any kind of theory which would allow you to know for certain that a “black-box” model is learning what you want it to learn, without understanding everything that it has learned exactly? I might not be able to actually formally verify that my neural network has learned exactly what I want it to (i.e. by extracting the knowledge out of it and comparing it to what I already know), but maybe I have formal proofs of the algorithm it is using and so I know its knowledge will be fairly robust under certain conditions. It’s basically the latter we need to be aiming for.
I agree! There’s a distinction between “we know exactly what knowledge is represented in this complicated black box” and “we have formal guarantees about properties of the black box”. It’s indeed very different to say “the AI will have a black box representing a model of human preferences” and “we will train the AI to build a model of human preferences using a bootstrapping scheme such as HCH, which we believe works because of these strong arguments”.
Perhaps more crisply, we should distinguish between black-boxes where we have a good grasp of why the box will behave as expected, and black boxes which we have little ability to reason about their behavior at all. I believe that both cousin_it and Eliezer (in the Artificial Mysterious Intelligence post), are referring to the folly of using the second type of black box in AI designs.
Perhaps related: Jessica Taylor’s discussion on top-level vs subsystem reasoning.
I think what you’re describing is possible, but very hard. Any progress in that direction would be much appreciated, of course.