General comment: I think its common for people working on theorems & math to forget the ways that the real world can violate their mathematical assumptions, even in very unlikely ways. The following is a concrete story of one thing that may go wrong. It is unlikely this particular problem goes wrong when proving any proofs or doing any theorems. The point is to illustrate a general lesson about subtle ways the world doesn’t live up to your platonic expectations.
For example, suppose that I have a proof that my AGI plan is 100% safe. I have a provably correct enumeration of all corrigibility properties, and a completely formally specified training story for why my architecture and training process will result in an AGI with those corrigibility properties, using some fancy new differential equation math my team invented & proved. We’ve run the whole thing through an automated proof checker, and it returned ‘yes, this is indeed a corrigible AGI you get at the end of this process’. None of us know how to program though, so we haven’t actually seen whether our new fancy differential equation math gives correct predictions on ML stuff.
We run it, and then it kills us. How is this possible?
||It turns out, there was a saddle point midway through training, and instead of going off to the (generalized-)left part of the saddle point like our differential equations would have done, due to the fact that we have a non-infinitesimal step-size, our model went off to the (generalized-)right part of the saddle point. We can also imagine other sources of non-platonicity in our physical system, like how its operating using floating point numbers instead of rationals or reals, the existence of bitflip errors in computers, the possibility of EM interference, how we aren’t actually the ones training the network, but we hired out contractors to do it for us on the cloud, and all the ways they could mess with our assumptions (for instance, maybe they mix up our Very Strict Reward Schedule, forgetting to train in maze environments [which secretly train out deceptive behaviors during a particular developmental milestone we’ve located in which this is possible] during epochs 500-550, realize this mistake only on epoch 729, and so train in maze environments in epochs 730-780 [where our model has already learned deceptive behaviors, passing that developmental milestone, so this does nothing]).||
More generally, just because you have theorems showing that if so-and-so variables matter, in the real world other variables may dominate, and the actually winning theorems lie in showing what you need to do to those other variables in order to achieve your objectives.
Robust to adversarial optimization?
I generally think of this in the context of creating feedback loops for alignment work (and I’m sure this is one of John’s main reasons for focusing much on robustness to adversarial optimization). If you have found a vector in activation space you’ve discovered, and reason that it must be consistently encoding the truth/false value of your statements, then this is useful knowledge. If you then go to try to train your models to always output text which agrees with the direction of that vector, then you are both selecting for models which encode the truth/falseness of statements in not that vector, and for models that always output the truth. In less capable models, you may get the ‘model that always outputs the truth’ end of that coin flip. In more capable ones, there’s more options present for the kinds of ways it could encode truthiness, so its more likely you optimize away from your interests.
Edit: In general, I think of proofs as a sort of portable ground-level truth for intuitions and attempted generalizations of phenomena. If you are operating very very abstractly, it can be difficult to touch the ground, so instead you bound yourself to testing your hypotheses by formal verification from things you can in fact see from the ground. Similarly for attempted generalizations. You can’t see the future, but you can reason about it from things you do see in the present. The only danger is that both reasoning very very abstractly and attempting to generalize introduce the possibility that you forgot to pay attention to not obviously (or pretty obviously depending on how in the sky you are) variables which make your overall argument actually invalid in the real world.
General comment: I think its common for people working on theorems & math to forget the ways that the real world can violate their mathematical assumptions, even in very unlikely ways. The following is a concrete story of one thing that may go wrong. It is unlikely this particular problem goes wrong when proving any proofs or doing any theorems. The point is to illustrate a general lesson about subtle ways the world doesn’t live up to your platonic expectations.
I feel inclined to disagree here, as this doesn’t really match my philosophy/intuitions for constructing theorems. See my reply to @harfe:
My philosophy is to set up minimal preconditions that describe the universe we live in.
Such preconditions will probably also describe other universes and may not give us very strong statements about our particular universe, but that’s fine. We can iterate on our theorems and build progressive theorems that tighten the preconditions of earlier theorems and give us stricter post conditions.
If reality doesn’t conform to the preconditions of your theorem, then you chose overly restrictive preconditions. I think that’s just doing theorems badly, not necessarily an inherent flaw of theorems.
Choosing preconditions that describe reality isn’t that hard; e.g. we have multiple equivalently powerful models of computation that seem to accurately describe what is mechanically computable within our universe.
Even with simplifying/idealising assumptions algorithm designs, algorithm analysis and complexity theoretic results translate (modulo some minor adjustments) to real world computers.
For example, suppose that I have a proof that my AGI plan is 100% safe. I have a provably correct enumeration of all corrigibility properties, and a completely formally specified training story for why my architecture and training process will result in an AGI with those corrigibility properties, using some fancy new differential equation math my team invented & proved. We’ve run the whole thing through an automated proof checker, and it returned ‘yes, this is indeed a corrigible AGI you get at the end of this process’. None of us know how to program though, so we haven’t actually seen whether our new fancy differential equation math gives correct predictions on ML stuff.
Again, this violates my philosophy for proofs of safety guarantees. I want an iterative proof process. We start with minimal restrictive preconditions, see what results we can get, verify them empirically, after our meta level confidence in the theorems is strong, we can use it as foundations for further theorems, building more elaborate constructs that give us stronger guarantees:
We could synthesise similar theorems, combine disparate theorems into a unified whole, generalise existing theorems, etc.
There are a lot of tools in the mathematician’s toolbox.
Our first theorems should be simple/straightforward and easy to verify.
Building towers of abstraction in the sky without touching reality is another way to do maths wrong.
We’d both want empirical validation at many steps during the theorem process and those kind of theorems don’t really admit excessive detail as was present in your example.
If the theorems fail to apply, it’ll look like the preconditions not being met. An example of this is the coherence theorems. They assume (among others) the following preconditions:
Path independent preferences/no internal agent state influencing preferences
A total order over preferences
Static preferences that don’t change with time
They fail to apply to humans because human preferences are more complicated than this model.
Robust empirical verification that the preconditions of the theorems hold/obtain in practice would be necessary to construct theorems that aren’t disconnected from reality.
General comment: I think its common for people working on theorems & math to forget the ways that the real world can violate their mathematical assumptions, even in very unlikely ways. The following is a concrete story of one thing that may go wrong. It is unlikely this particular problem goes wrong when proving any proofs or doing any theorems. The point is to illustrate a general lesson about subtle ways the world doesn’t live up to your platonic expectations.
For example, suppose that I have a proof that my AGI plan is 100% safe. I have a provably correct enumeration of all corrigibility properties, and a completely formally specified training story for why my architecture and training process will result in an AGI with those corrigibility properties, using some fancy new differential equation math my team invented & proved. We’ve run the whole thing through an automated proof checker, and it returned ‘yes, this is indeed a corrigible AGI you get at the end of this process’. None of us know how to program though, so we haven’t actually seen whether our new fancy differential equation math gives correct predictions on ML stuff.
We run it, and then it kills us. How is this possible?
||It turns out, there was a saddle point midway through training, and instead of going off to the (generalized-)left part of the saddle point like our differential equations would have done, due to the fact that we have a non-infinitesimal step-size, our model went off to the (generalized-)right part of the saddle point. We can also imagine other sources of non-platonicity in our physical system, like how its operating using floating point numbers instead of rationals or reals, the existence of bitflip errors in computers, the possibility of EM interference, how we aren’t actually the ones training the network, but we hired out contractors to do it for us on the cloud, and all the ways they could mess with our assumptions (for instance, maybe they mix up our Very Strict Reward Schedule, forgetting to train in maze environments [which secretly train out deceptive behaviors during a particular developmental milestone we’ve located in which this is possible] during epochs 500-550, realize this mistake only on epoch 729, and so train in maze environments in epochs 730-780 [where our model has already learned deceptive behaviors, passing that developmental milestone, so this does nothing]).||
More generally, just because you have theorems showing that if so-and-so variables matter, in the real world other variables may dominate, and the actually winning theorems lie in showing what you need to do to those other variables in order to achieve your objectives.
I generally think of this in the context of creating feedback loops for alignment work (and I’m sure this is one of John’s main reasons for focusing much on robustness to adversarial optimization). If you have found a vector in activation space you’ve discovered, and reason that it must be consistently encoding the truth/false value of your statements, then this is useful knowledge. If you then go to try to train your models to always output text which agrees with the direction of that vector, then you are both selecting for models which encode the truth/falseness of statements in not that vector, and for models that always output the truth. In less capable models, you may get the ‘model that always outputs the truth’ end of that coin flip. In more capable ones, there’s more options present for the kinds of ways it could encode truthiness, so its more likely you optimize away from your interests.
Edit: In general, I think of proofs as a sort of portable ground-level truth for intuitions and attempted generalizations of phenomena. If you are operating very very abstractly, it can be difficult to touch the ground, so instead you bound yourself to testing your hypotheses by formal verification from things you can in fact see from the ground. Similarly for attempted generalizations. You can’t see the future, but you can reason about it from things you do see in the present. The only danger is that both reasoning very very abstractly and attempting to generalize introduce the possibility that you forgot to pay attention to not obviously (or pretty obviously depending on how in the sky you are) variables which make your overall argument actually invalid in the real world.
I feel inclined to disagree here, as this doesn’t really match my philosophy/intuitions for constructing theorems. See my reply to @harfe:
If reality doesn’t conform to the preconditions of your theorem, then you chose overly restrictive preconditions. I think that’s just doing theorems badly, not necessarily an inherent flaw of theorems.
Choosing preconditions that describe reality isn’t that hard; e.g. we have multiple equivalently powerful models of computation that seem to accurately describe what is mechanically computable within our universe.
Even with simplifying/idealising assumptions algorithm designs, algorithm analysis and complexity theoretic results translate (modulo some minor adjustments) to real world computers.
Again, this violates my philosophy for proofs of safety guarantees. I want an iterative proof process. We start with minimal restrictive preconditions, see what results we can get, verify them empirically, after our meta level confidence in the theorems is strong, we can use it as foundations for further theorems, building more elaborate constructs that give us stronger guarantees:
Our first theorems should be simple/straightforward and easy to verify.
Building towers of abstraction in the sky without touching reality is another way to do maths wrong.
To be clear, the theorems I vibe with the most are a generalisation of the selection theorems agenda.
We’d both want empirical validation at many steps during the theorem process and those kind of theorems don’t really admit excessive detail as was present in your example.
If the theorems fail to apply, it’ll look like the preconditions not being met. An example of this is the coherence theorems. They assume (among others) the following preconditions:
Path independent preferences/no internal agent state influencing preferences
A total order over preferences
Static preferences that don’t change with time
They fail to apply to humans because human preferences are more complicated than this model.
Robust empirical verification that the preconditions of the theorems hold/obtain in practice would be necessary to construct theorems that aren’t disconnected from reality.