I was halfway through a PhD on software testing and verification before joining Anthropic (opinions my own, etc), and I’m less convinced than Eliezer about theorem-proving for AGI safety.
There are so many independently fatal objections that I don’t know how to structure this or convince someone who thinks it would work. I am therefore offering a $1,000 prize for solving a far easier problem:
Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.
Your proof must not include executing the model, nor equivalent computations (e.g. concolic execution). You may train a custom model and/or directly set model weights, so long as it uses a standard EfficientNet architecture and reaches 99% accuracy. Bonus points for proving more of the sensitivity curve.
I will also take bets that nobody will accomplish this by 2025, nor any loosely equivalent proof for a GPT-3 class model by 2040. This is a very bold claim, but I believe that rigorously proving even trivial global bounds on the behaviour of large learned models will remain infeasible.
And doing this wouldn’t actually help, because a theorem is about the inside of your proof system. Recognising the people in a huge collection of atoms is outside your proof system. Analogue attacks like Rowhammer are not in (the ISA model used by) your proof system—and cache and timing attacks like Spectre probably aren’t yet either. Your proof system isn’t large enough to model the massive floating-point computation inside GPT-2, let alone GPT-3, and if it could the bounds would be (−∞,∞).
I still hope that providing automatic disproof-by-counterexample might, in the long term, nudge ML towards specifiability by making it easy to write and check falsifiable properties of ML systems. On the other hand, hoping that ML switches to a safer paradigm is not the kind of safety story I’d be comfortable relying on.
You’re attacking a strawman of what kind of theorems we want to prove. Obviously we are not going to prove theorems that contain specific datasets as part of the statement. What we’re going to do is build a theory founded on certain assumptions about the real-world (such as locality / decoupling of dynamics on different scales / some kind of chaos / certain bounds on computational complexity / existence of simple fundamental laws etc) and humans (e.g. that they are approximately rational agents, for some definition thereof). Such a theory can produce many insights about what factors influence e.g. the possibility of adversarial attacks that you mention, most of which will be qualitative and some of which can be made quantitative by combining with empirical research (such as the research OpenAI does on scaling laws).
And, ofc the theory is only as good as its assumptions. Ofc if there are attack vectors your model doesn’t account for, your system can be attacked. Having a theory is not a magical wand that immediately solves everything. But, it does put you in a much, much better position than working off pure guesswork and intuition.
Another angle is that, once we can at least state the theorem we might try to make the AI itself prove it. This can still fail: maybe the theorem is practically unprovable, or maybe we can’t safely train the AI to prove theorems. But it does give us some additional leverage.
First, an apology: I didn’t mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I’m sorry I wasn’t clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/empirical analysis could be very valuable.
Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler approximations cannot provide reliable guarantees about the behaviour of large models with complex emergent behaviour. This is in large part because formalising realistic assumptions (e.g. biased humans) is very difficult, and somewhat because proving anything about very large models is wildly beyond the state of the art and even verified systems have (fewer) bugs.
Being able to state theorems about AGI seems absolutely necessary for success; but I don’t think it’s close to sufficient.
I think we might have some disagreement about degree more than about kind. I think that we are probably going to design architectures that make proving easier rather than proving things about architectures optimized only for capability, but not necessarily. Moreover, some qualitative properties are not sensitive to architecture and we can prove them about classes of architectures that include those optimized for capability. And, I think humans also belong to a useful class of agents with simple description (e.g. along the lines I described here) and you don’t need anything like a detailed model of bias. And, people do manage to prove some things about large models, e.g. this, just not enough things. And, some of the proofs might be produced by the system itself in runtime (e.g. the system will have a trustworthy/rigorous part and an untrustworthy/heuristic part and the rigorous part will make sure the heuristic part is proving the safety of its proposals before they are implemented).
I think the pipeline of success looks something like theoretical models ⇒ phenomenological models (i.e. models informed by a combination of theory and experiment) ⇒ security-mindset engineering (i.e. engineering that keeps track of the differences between models and reality and makes sure they are suitably bounded / irrelevant) ⇒ plethora of security-mindset testing methods, including but not limited to formal verification (i.e. aiming for fool-proof test coverage while also making sure that, modulo previous tests, each test involves using the system in safe ways even if it has bugs). And ofc it’s not a pure waterfall, there is feedback from each stage to previous stages.
That might actually be easy to prove with some effort (or it might not), consider this strategy:
Let’s assume that the input to the system are PNG images with 8-bit values between 0 and 255, that are converted into floating-point tensors before entering the net, and that the bits you are talking about are those of the original images. And lets also assume that the upscaling from the small MNIST images to the input of the net is such that each float of the tensor corresponds to exactly one value in the original image (that is, there is no interpolation). And that we are using the exact architecture of table 1 of the EfficientNet paper (https://arxiv.org/pdf/1905.11946v5.pdf).
Then we might be able to train the network freezing the weights of the first Conv3x3 to the identity (1 in the centre of the kernel and 0s around it) so that the 32 channels of the next layer receive 2x downscaled copies of the image. We also remove the first batch norm layer, it will be added later. Losing the ability to perform useful computations in this layer is probably not enough to make it perform below 99% accuracy given that MNIST is so simple.
If this works, then we need to edit the weights of the first Conv layer and of the batch norm layer (if they can be called weights) to (assuming float32):
Central value of kernel: 2**-149
Division value in batch norm: 2**-49
Multiplication value of the batch norm: 2**100
All other values to the neutral element of their operations
This will turn all values of 127 (in the PNG) and below into exactly 0, and values above into exactly 1 because of numerical underflow (I checked this with numpy). The network will process this from the swish activation onwards exactly the same way it would process the binarized versions of the images, and in all likelihood give them the same class.
127 in base 2 is 01111111, and 128 is 10000000. You can’t go from a perceived black pixel to a perceived white pixel by only changing the last bit, so the maximum difference in class probability that you can obtain by only changing the least significant bit is zero.
I note that this seems pretty strongly disanalogous to any kind of safety proof we care about, and doesn’t generalise to any reasoning about nonzero sensitivity. That said your assumptions seem fair to me, and I’d be happy to pay out the prize if this actually works.
Yes, I’m aware of that, I tried to find a better proof but failed. Attempts based on trying to compute the maximum possible change (instead of figuring out how to get a desired change) are doomed. Changing the last bit isn’t an infinitesimal change, so using calculus to compute the maximum derivative won’t work. EfficientNets use swish activations, not ReLUs, which aren’t locally linear, so we will have to deal with the chaotic dynamics that show up whenever non-linear functions are iteratively applied. The sigmoid inside the swish does eventually saturate because of floating-point arithmetic, making it effectively locally linear, but then we have to deal with the details of the exact binary representations.
There might be another solution, though: the softmax is the n-input generalization of the sigmoid, so it can also saturate to exact 0s and 1s. We could try to overfit a network using a very high learning rate so that for one randomly generated but fixed image it predicts some class with 100% certainty and given the same image with some pixel changed it predicts another class with also 100% certainty. Then, if this works, we could try training it on MNIST, adding the two original images to every mini-batch with a loss multiplier much greater than for the regular samples. That way the answer to your question becomes 1.
If nobody comes up with anything better, when I get some free time I will try to implement the binarization approach and then I will send you the code.
Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.
You aren’t counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you’ll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.
If you can produce a checkable proof of this over the actual EfficientNet architecture, I’d pay out the prize. Note that this uses floating-point numbers, not the reals!
I’m confused because I understand you to be asking for a bound on the derivative of an EfficientNet model, but it seems quite easy (though perhaps kind of a hassle) to get this bound.
I don’t think the floating point numbers matter very much (especially if you’re ok with the bound being computed a bit more loosely).
Ah, crux: I do think the floating-point matters! Issues of precision, underflow, overflow, and NaNs bedevil model training and occasionally deployment-time behavior. By analogy, if we deploy an AGI the ideal mathematical form of which is aligned we may still be doomed, even it’s plausibly our best option in expectation.
Checkable meaning that I or someone I trust with this has to be able to check it! Maxwell’s proposal is simple enough that I can reason through the whole thing, even over float32 rather than R, but for more complex arguments I’d probably want it machine-checkable for at least the tricky numeric parts.
I was halfway through a PhD on software testing and verification before joining Anthropic (opinions my own, etc), and I’m less convinced than Eliezer about theorem-proving for AGI safety.
There are so many independently fatal objections that I don’t know how to structure this or convince someone who thinks it would work. I am therefore offering a $1,000 prize for solving a far easier problem:
I will also take bets that nobody will accomplish this by 2025, nor any loosely equivalent proof for a GPT-3 class model by 2040. This is a very bold claim, but I believe that rigorously proving even trivial global bounds on the behaviour of large learned models will remain infeasible.
And doing this wouldn’t actually help, because a theorem is about the inside of your proof system. Recognising the people in a huge collection of atoms is outside your proof system. Analogue attacks like Rowhammer are not in (the ISA model used by) your proof system—and cache and timing attacks like Spectre probably aren’t yet either. Your proof system isn’t large enough to model the massive floating-point computation inside GPT-2, let alone GPT-3, and if it could the bounds would be (−∞,∞).
I still hope that providing automatic disproof-by-counterexample might, in the long term, nudge ML towards specifiability by making it easy to write and check falsifiable properties of ML systems. On the other hand, hoping that ML switches to a safer paradigm is not the kind of safety story I’d be comfortable relying on.
You’re attacking a strawman of what kind of theorems we want to prove. Obviously we are not going to prove theorems that contain specific datasets as part of the statement. What we’re going to do is build a theory founded on certain assumptions about the real-world (such as locality / decoupling of dynamics on different scales / some kind of chaos / certain bounds on computational complexity / existence of simple fundamental laws etc) and humans (e.g. that they are approximately rational agents, for some definition thereof). Such a theory can produce many insights about what factors influence e.g. the possibility of adversarial attacks that you mention, most of which will be qualitative and some of which can be made quantitative by combining with empirical research (such as the research OpenAI does on scaling laws).
And, ofc the theory is only as good as its assumptions. Ofc if there are attack vectors your model doesn’t account for, your system can be attacked. Having a theory is not a magical wand that immediately solves everything. But, it does put you in a much, much better position than working off pure guesswork and intuition.
Another angle is that, once we can at least state the theorem we might try to make the AI itself prove it. This can still fail: maybe the theorem is practically unprovable, or maybe we can’t safely train the AI to prove theorems. But it does give us some additional leverage.
First, an apology: I didn’t mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I’m sorry I wasn’t clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/empirical analysis could be very valuable.
Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler approximations cannot provide reliable guarantees about the behaviour of large models with complex emergent behaviour. This is in large part because formalising realistic assumptions (e.g. biased humans) is very difficult, and somewhat because proving anything about very large models is wildly beyond the state of the art and even verified systems have (fewer) bugs.
Being able to state theorems about AGI seems absolutely necessary for success; but I don’t think it’s close to sufficient.
I think we might have some disagreement about degree more than about kind. I think that we are probably going to design architectures that make proving easier rather than proving things about architectures optimized only for capability, but not necessarily. Moreover, some qualitative properties are not sensitive to architecture and we can prove them about classes of architectures that include those optimized for capability. And, I think humans also belong to a useful class of agents with simple description (e.g. along the lines I described here) and you don’t need anything like a detailed model of bias. And, people do manage to prove some things about large models, e.g. this, just not enough things. And, some of the proofs might be produced by the system itself in runtime (e.g. the system will have a trustworthy/rigorous part and an untrustworthy/heuristic part and the rigorous part will make sure the heuristic part is proving the safety of its proposals before they are implemented).
I think the pipeline of success looks something like theoretical models ⇒ phenomenological models (i.e. models informed by a combination of theory and experiment) ⇒ security-mindset engineering (i.e. engineering that keeps track of the differences between models and reality and makes sure they are suitably bounded / irrelevant) ⇒ plethora of security-mindset testing methods, including but not limited to formal verification (i.e. aiming for fool-proof test coverage while also making sure that, modulo previous tests, each test involves using the system in safe ways even if it has bugs). And ofc it’s not a pure waterfall, there is feedback from each stage to previous stages.
That might actually be easy to prove with some effort (or it might not), consider this strategy:
Let’s assume that the input to the system are PNG images with 8-bit values between 0 and 255, that are converted into floating-point tensors before entering the net, and that the bits you are talking about are those of the original images. And lets also assume that the upscaling from the small MNIST images to the input of the net is such that each float of the tensor corresponds to exactly one value in the original image (that is, there is no interpolation). And that we are using the exact architecture of table 1 of the EfficientNet paper (https://arxiv.org/pdf/1905.11946v5.pdf).
Then we might be able to train the network freezing the weights of the first Conv3x3 to the identity (1 in the centre of the kernel and 0s around it) so that the 32 channels of the next layer receive 2x downscaled copies of the image. We also remove the first batch norm layer, it will be added later. Losing the ability to perform useful computations in this layer is probably not enough to make it perform below 99% accuracy given that MNIST is so simple.
If this works, then we need to edit the weights of the first Conv layer and of the batch norm layer (if they can be called weights) to (assuming float32):
Central value of kernel: 2**-149
Division value in batch norm: 2**-49
Multiplication value of the batch norm: 2**100
All other values to the neutral element of their operations
(check https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
This will turn all values of 127 (in the PNG) and below into exactly 0, and values above into exactly 1 because of numerical underflow (I checked this with numpy). The network will process this from the swish activation onwards exactly the same way it would process the binarized versions of the images, and in all likelihood give them the same class.
127 in base 2 is 01111111, and 128 is 10000000. You can’t go from a perceived black pixel to a perceived white pixel by only changing the last bit, so the maximum difference in class probability that you can obtain by only changing the least significant bit is zero.
I note that this seems pretty strongly disanalogous to any kind of safety proof we care about, and doesn’t generalise to any reasoning about nonzero sensitivity. That said your assumptions seem fair to me, and I’d be happy to pay out the prize if this actually works.
Yes, I’m aware of that, I tried to find a better proof but failed. Attempts based on trying to compute the maximum possible change (instead of figuring out how to get a desired change) are doomed. Changing the last bit isn’t an infinitesimal change, so using calculus to compute the maximum derivative won’t work. EfficientNets use swish activations, not ReLUs, which aren’t locally linear, so we will have to deal with the chaotic dynamics that show up whenever non-linear functions are iteratively applied. The sigmoid inside the swish does eventually saturate because of floating-point arithmetic, making it effectively locally linear, but then we have to deal with the details of the exact binary representations.
There might be another solution, though: the softmax is the n-input generalization of the sigmoid, so it can also saturate to exact 0s and 1s. We could try to overfit a network using a very high learning rate so that for one randomly generated but fixed image it predicts some class with 100% certainty and given the same image with some pixel changed it predicts another class with also 100% certainty. Then, if this works, we could try training it on MNIST, adding the two original images to every mini-batch with a loss multiplier much greater than for the regular samples. That way the answer to your question becomes 1.
If nobody comes up with anything better, when I get some free time I will try to implement the binarization approach and then I will send you the code.
Sure, just remember that an experimental demonstration isn’t enough—“Your proof must not include executing the model, nor equivalent computations”.
Am I correct that you wouldn’t find a bound acceptable, you specifically want the exact maximum?
You aren’t counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you’ll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.
If you can produce a checkable proof of this over the actual EfficientNet architecture, I’d pay out the prize. Note that this uses floating-point numbers, not the reals!
Would running the method in this paper on EfficientNet count?
What if we instead used a weaker but still sound method (e.g. based on linear programs instead of semidefinite programs)?
On a quick skim it looks like that fails both “not equivalent to executing the model” and the
float32
vs R problem.It’s a nice approach, but I’d also be surprised if it scales to maintain tight bounds on much larger networks.
By “checkable” do you mean “machine checkable”?
I’m confused because I understand you to be asking for a bound on the derivative of an EfficientNet model, but it seems quite easy (though perhaps kind of a hassle) to get this bound.
I don’t think the floating point numbers matter very much (especially if you’re ok with the bound being computed a bit more loosely).
Ah, crux: I do think the floating-point matters! Issues of precision, underflow, overflow, and NaNs bedevil model training and occasionally deployment-time behavior. By analogy, if we deploy an AGI the ideal mathematical form of which is aligned we may still be doomed, even it’s plausibly our best option in expectation.
Checkable meaning that I or someone I trust with this has to be able to check it! Maxwell’s proposal is simple enough that I can reason through the whole thing, even over
float32
rather than R, but for more complex arguments I’d probably want it machine-checkable for at least the tricky numeric parts.I’d award half the prize for a non-trivial bound.