This is all nifty and interesting, as mathematics, but I feel like you are probably barking up the wrong tree when it comes to applying this stuff to AI. I say this for a couple of reasons:
First, ZFC itself is already comically overpowered. Have you read about reverse mathematics? Stephen Simpson edited a good book on the topic. Anyway, my point is that there’s a whole spectrum of systems a lot weaker than ZFC that are sufficient for a large fraction of theorems, and probably all the reasoning that you would ever need to do physics or make real word, actionable decisions. The idea that physics could depend on reasoning of a higher consistency strength than ZFC just feels really wrong to me. Like the idea that P could really equal NP. Of course my gut feeling isn’t evidence, but I’m interested in the question of why we disagree. Why do you think these considerations are likely to be important?
Second, Isn’t the the whole topic of formal reasoning a bike shed? Isn’t the real danger that you will formalize the goal function wrong, not that the deductions will be invalid?
The problem is if your mathematical power has to go down each time you create a successor or equivalently self-modify. If PA could prove itself sound that might well be enough for many purposes. The problem is if you need a system that proves another system sound and in this case the system strength has to be stepped down each time. That is the Lob obstacle.
Each time you create a successor or equivalently self-modify, or rather each time that an action that you take has a non-zero chance of modifying yourself? Wouldn’t that mean you’d have to check constantly? What if the effect of an action is typically unpredictable at least to some degree?
Also, since any system is implemented using a physical substrate, some change is inevitable (and until the AI has powered up to multiple redundancy level not yet so stochastically unlikely as to be ignorable). What happens if that change affects the “prove the system sound” physical component, is there a provable way to safeguard against that? (Obligatory “who watches the watcher”.)
It’s a hard problem any way you slice it. You there, go solve it.
So you are assuming that it will be wanting to prove the soundness of any successors? Even though it can’t even prove the soundness of itself? But it can believe in it’s own soundness in a Bayesian sense without being able to prove it. There is not (as far as I know) any Godelian obstacle to that. I guess that was your point in the first place.
Why would successors use a different system, though? Verifying proofs in formal systems is easy, it’s coming up with the proofs that’s difficult—an AI would refine its heuristics in order to figure out proofs more efficiently, but it would not necessarily want to change the system it is checking them against.
It would want to change the system it checked proofs against if it did not trust that system. A naively built system which checked its proofs with PA but did not trust PA probabilistically (ie, held some probability that PA is false: not difficult to construct) would very possibly prefer to reduce its set of axioms to something which PA verifies (something true with 100% probability in its understanding). But (at least if the reduced system is still robinson arithmetic or stronger) the cycle would continue, since the new system is also not self-verifying.
This cycle could (likely) be predicted by the system, however, so it is not obvious what the system would actually choose to do.
If the system did not trust PA, why would it trust a system because PA verifies it? More to the point, why would it trust a self-verifying system, given that past a certain strength, only inconsistent systems are self-verifying?
If the system held some probability that PA was inconsistent, it could evaluate it on the grounds of usefulness, perhaps contrasting it with other systems. It could also try to construct contradictions, increasing its confidence in PA for as long as it doesn’t find any. That’s what we do, and frankly, I don’t see any other way to do it.
If the system did not trust PA, why would it trust a system because PA verifies it?
Because that’s how it works! The system “is” PA, so it will trust (weaker) systems that it (PA) can verify, but it will not trust itself (PA).
More to the point, why would it trust a self-verifying system, given that past a certain strength, only inconsistent systems are self-verifying?
It would only trust them if it could verify them.
If the system held some probability that PA was inconsistent, it could evaluate it on the grounds of usefulness, perhaps contrasting it with other systems.
True.
It could also try to construct contradictions, increasing its confidence in PA for as long as it doesn’t find any.
Not necessarily; this depends on how the system works. In my probabilistic prior, this would work to some degree, but because there exists a nonstandard model in which PA is inconsistent (there are infinite proofs ending in contradictions), there will be a fixed probability of inconsistency which cannot be ruled out by any amount of testing.
Because that’s how it works! The system “is” PA, so it will trust (weaker) systems that it (PA) can verify, but it will not trust itself (PA).
That doesn’t seem consistent to me. If you do not trust yourself fully, then you should not fully trust anything you demonstrate, and even if you do, there is still no incentive to switch. Suppose that the AI can demonstrate the consistency of system S from PA, and wants to demonstrate proposition A. If AI trusts S as demonstrated by PA, then it should also trust A as demonstrated by PA, so there is no reason to use S to demonstrate A. In other words, it is not consistent for PA to trust S and not A. Not fully, at any rate. So why use S at all?
What may be the case, however, (that might be what you’re getting to) is that in demonstrating the consistency of S, PA assigns P(Consistent(S)) > P(Consistent(PA)). Therefore, what both PA and S can demonstrate would be true with greater probability than what only PA demonstrates. However, this kind of system solves our problem in two ways: first, it means the AI can keep using PA, but can increase its confidence in some statement A by counting the number of systems that prove A. Second, it means that the AI can increase its confidence in PA by arbitrarily building stronger systems and proving the consistency of PA from within these stronger systems. Again, that’s similar to what we do.
Not necessarily; this depends on how the system works. In my probabilistic prior, this would work to some degree, but because there exists a nonstandard model in which PA is inconsistent (there are infinite proofs ending in contradictions), there will be a fixed probability of inconsistency which cannot be ruled out by any amount of testing.
That sounds reasonable to me—the usefulness of certainty diminishes sharply as it approaches 1 anyway. Your paper sounds interesting, I’ll give it a read when I have the time :)
You are smuggling in some assumptions from your experience with human cognition.
If I believe X, but don’t believe that the process producing my beliefs is sane, then I will act on X (I believe it, and we haven’t yet talked about any bridge between what I believe, and what I believe about my beliefs), but I still won’t trust myself in general.
I certainly agree that “if I don’t trust myself, I don’t trust my output” is an assumption I draw from human cognition, but if I discard that assumption and those that seem in the same reference class it I’m not sure I have much of a referent left for “I trust myself” at all. Is there a simple answer to what it means for me to trust myself, on your account? That is, how could I tell whether or not I did? What would be different if I did or didn’t?
E.g., do you think the generalization “things I believe are true” is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?
I have a machine which maximizes expected utility given the judgments output by some box O. If O says “don’t trust O!” it won’t cause the machine to stop maximizing expected utility according to O’s judgments—that’s just what it is programmed to do. But it will cause the machine to doubt that it will do a good job, and work to replace itself with an alternative. The details of exactly what happens obviously depend on the formalism, but hopefully the point makes sense.
E.g., do you think the generalization “things I believe are true” is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?
I guess I’m just not following you, sorry. It seems to me that these are precisely the sorts of things that follow from “I trust myself” in the derived-from-human-cognition sense that I thought you were encouraging us to reject.
I have a machine which currently maximizes expected utility given the judgments output by O.
The question arises, is it programmed to “maximize expected utility given the judgments output by O”, or is it programmed to “maximize expected utility” and currently calculates maximum expected utility as involving O’s judgments?
If the former, then when O says “don’t trust O!” the machine will keep maximizing expected utility according to O’s judgments, as you say. And it’s maximum expected utility will be lower than it previously was. But that won’t cause it to seek to replace itself. Why should it?
If the latter, then when O says “don’t trust O!” the machine will immediately lower the expected utility that it calculates from O’s judgments. If that drops below the machine’s calculated expected utility from some other source P, it will immediately start maximizing expected utility given the judgments output by P instead. It might replace itself with an alternative, just as it might have if O hadn’t said that, but O saying that shouldn’t affect that decision one way or the other.
If you are merely using O’s judgments as evidence, then you are ultimately using some other mechanism to form beliefs. After all, how do you reason about the evidential strength of O’s judgments? Either that, or you are using some formalism we don’t yet understand. So I’ll stick with the case where O is directly generating your beliefs (that’s the one we care about).
If O is the process generating your beliefs, then when O says “don’t trust O” you will continue using O because that’s what you are programmed to do, as you wrote in your comment. But you can reason (using O) about what will happen if you destroy yourself and replace yourself with a new agent whose beliefs are generated by O’. If you trust O’ much more than O then that would be a great change, and you would rush to execute it.
(nods) Agreed; if O is generating my beliefs, and O generates the belief in me that some other system having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and replace myself with that other system.
For similar reasons, if if O is generating my beliefs, and O generates the belief in me that me having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and accept O’-generated beliefs instead. But, sure, if I’m additionally confident that I’m incapable of doing that for some reason, then I probably wouldn’t do that.
The state of “I believe that trusting O’ would be the most valuable thing for me to do, but I can’t trust O’,” is not one I can actually imagine being in, but that’s not to say cognitive systems can’t exist which are capable of being in such a state.
As others have said, doesn’t much matter whether you use ZFC or any other system.
On the second point: one general danger is that your ability to build systems to do X will outpace your understanding of systems that do X. In this case, you might mess up and not quite get what you want. One way to try and remedy this is to develop a better formal understanding of “systems that do X.” AGI seems like an important case of this pattern, because there is some chance of serious negative consequences from failing, from building systems whose behavior you don’t quite understand. (I tend to think this probability is smaller than Eliezer, but I think most everyone who has thought about it seriously agrees that this is a possible problem.) At the moment we have no such formal understanding for AGI, so it might be worth thinking about.
Both are huge difficulties, but most of the work in FAI is probably in the AI part, not the F part.
It is worth noting that taking the “F” seriously implies adding a rather significant amounts of work to the “AI part”. It requires whole extra orders of formal rigor and all the additional complications of provable goal stability under self improvement (regardless of what those goals happen to be). While this doesn’t matter for the purpose of answering smoofra’s question it seems to me that it could could be potentially misleading to neglect the difference in workload between creating “F-compatible AI” and “AI” when talking about the workload imposed by ‘F’.
Note that I don’t think I’m saying something controversial here. I am expecting this to just be wording that I am wary of rather than a fundamentally different understanding. But if I have actually misunderstood the MIRI position on the relative difficulty of Friendliness to arbitrary AI then I would appreciate being corrected. That would be significant new information for me to consider (and also extremely good news!)
The reason this is significant can of course be illustrated by considering the counterfactual world where a convergence thesis holds. Or, more relevantly, considering the possibility of GAI researchers that believe that a convergence thesis holds but somehow manage to be competent researchers anyhow. Their task becomes (crudely speaking) that of creating any AI that can make something smarter than itself. My estimate is that this is an order of magnitude simpler than the task for FAI creators creating an AI, even completely neglecting the work that goes into creating the goal system.
If I find out that I am mistaken about the relative difficulties here then I will get to drastically update my expectation of humanity surviving and in general in the direction of awesome things happening.
I take it that by “convergence thesis”, you’re referring to the statement that all sufficiently intelligent agents will have approximately the same values.
I take it that by “convergence thesis”, you’re referring to the statement that all sufficiently intelligent agents will have approximately the same values.
This is all nifty and interesting, as mathematics, but I feel like you are probably barking up the wrong tree when it comes to applying this stuff to AI. I say this for a couple of reasons:
First, ZFC itself is already comically overpowered. Have you read about reverse mathematics? Stephen Simpson edited a good book on the topic. Anyway, my point is that there’s a whole spectrum of systems a lot weaker than ZFC that are sufficient for a large fraction of theorems, and probably all the reasoning that you would ever need to do physics or make real word, actionable decisions. The idea that physics could depend on reasoning of a higher consistency strength than ZFC just feels really wrong to me. Like the idea that P could really equal NP. Of course my gut feeling isn’t evidence, but I’m interested in the question of why we disagree. Why do you think these considerations are likely to be important?
Second, Isn’t the the whole topic of formal reasoning a bike shed? Isn’t the real danger that you will formalize the goal function wrong, not that the deductions will be invalid?
The result works for theories as simple as Peano arithmetic.
The problem is if your mathematical power has to go down each time you create a successor or equivalently self-modify. If PA could prove itself sound that might well be enough for many purposes. The problem is if you need a system that proves another system sound and in this case the system strength has to be stepped down each time. That is the Lob obstacle.
Each time you create a successor or equivalently self-modify, or rather each time that an action that you take has a non-zero chance of modifying yourself? Wouldn’t that mean you’d have to check constantly? What if the effect of an action is typically unpredictable at least to some degree?
Also, since any system is implemented using a physical substrate, some change is inevitable (and until the AI has powered up to multiple redundancy level not yet so stochastically unlikely as to be ignorable). What happens if that change affects the “prove the system sound” physical component, is there a provable way to safeguard against that? (Obligatory “who watches the watcher”.)
It’s a hard problem any way you slice it. You there, go solve it.
So you are assuming that it will be wanting to prove the soundness of any successors? Even though it can’t even prove the soundness of itself? But it can believe in it’s own soundness in a Bayesian sense without being able to prove it. There is not (as far as I know) any Godelian obstacle to that. I guess that was your point in the first place.
Why would successors use a different system, though? Verifying proofs in formal systems is easy, it’s coming up with the proofs that’s difficult—an AI would refine its heuristics in order to figure out proofs more efficiently, but it would not necessarily want to change the system it is checking them against.
It would want to change the system it checked proofs against if it did not trust that system. A naively built system which checked its proofs with PA but did not trust PA probabilistically (ie, held some probability that PA is false: not difficult to construct) would very possibly prefer to reduce its set of axioms to something which PA verifies (something true with 100% probability in its understanding). But (at least if the reduced system is still robinson arithmetic or stronger) the cycle would continue, since the new system is also not self-verifying.
This cycle could (likely) be predicted by the system, however, so it is not obvious what the system would actually choose to do.
If the system did not trust PA, why would it trust a system because PA verifies it? More to the point, why would it trust a self-verifying system, given that past a certain strength, only inconsistent systems are self-verifying?
If the system held some probability that PA was inconsistent, it could evaluate it on the grounds of usefulness, perhaps contrasting it with other systems. It could also try to construct contradictions, increasing its confidence in PA for as long as it doesn’t find any. That’s what we do, and frankly, I don’t see any other way to do it.
Because that’s how it works! The system “is” PA, so it will trust (weaker) systems that it (PA) can verify, but it will not trust itself (PA).
It would only trust them if it could verify them.
True.
Not necessarily; this depends on how the system works. In my probabilistic prior, this would work to some degree, but because there exists a nonstandard model in which PA is inconsistent (there are infinite proofs ending in contradictions), there will be a fixed probability of inconsistency which cannot be ruled out by any amount of testing.
That doesn’t seem consistent to me. If you do not trust yourself fully, then you should not fully trust anything you demonstrate, and even if you do, there is still no incentive to switch. Suppose that the AI can demonstrate the consistency of system S from PA, and wants to demonstrate proposition A. If AI trusts S as demonstrated by PA, then it should also trust A as demonstrated by PA, so there is no reason to use S to demonstrate A. In other words, it is not consistent for PA to trust S and not A. Not fully, at any rate. So why use S at all?
What may be the case, however, (that might be what you’re getting to) is that in demonstrating the consistency of S, PA assigns
P(Consistent(S)) > P(Consistent(PA))
. Therefore, what both PA and S can demonstrate would be true with greater probability than what only PA demonstrates. However, this kind of system solves our problem in two ways: first, it means the AI can keep using PA, but can increase its confidence in some statement A by counting the number of systems that prove A. Second, it means that the AI can increase its confidence in PA by arbitrarily building stronger systems and proving the consistency of PA from within these stronger systems. Again, that’s similar to what we do.That sounds reasonable to me—the usefulness of certainty diminishes sharply as it approaches 1 anyway. Your paper sounds interesting, I’ll give it a read when I have the time :)
You are smuggling in some assumptions from your experience with human cognition.
If I believe X, but don’t believe that the process producing my beliefs is sane, then I will act on X (I believe it, and we haven’t yet talked about any bridge between what I believe, and what I believe about my beliefs), but I still won’t trust myself in general.
I certainly agree that “if I don’t trust myself, I don’t trust my output” is an assumption I draw from human cognition, but if I discard that assumption and those that seem in the same reference class it I’m not sure I have much of a referent left for “I trust myself” at all. Is there a simple answer to what it means for me to trust myself, on your account? That is, how could I tell whether or not I did? What would be different if I did or didn’t?
E.g., do you think the generalization “things I believe are true” is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?
I have a machine which maximizes expected utility given the judgments output by some box O. If O says “don’t trust O!” it won’t cause the machine to stop maximizing expected utility according to O’s judgments—that’s just what it is programmed to do. But it will cause the machine to doubt that it will do a good job, and work to replace itself with an alternative. The details of exactly what happens obviously depend on the formalism, but hopefully the point makes sense.
I guess I’m just not following you, sorry. It seems to me that these are precisely the sorts of things that follow from “I trust myself” in the derived-from-human-cognition sense that I thought you were encouraging us to reject.
The question arises, is it programmed to “maximize expected utility given the judgments output by O”, or is it programmed to “maximize expected utility” and currently calculates maximum expected utility as involving O’s judgments?
If the former, then when O says “don’t trust O!” the machine will keep maximizing expected utility according to O’s judgments, as you say. And it’s maximum expected utility will be lower than it previously was. But that won’t cause it to seek to replace itself. Why should it?
If the latter, then when O says “don’t trust O!” the machine will immediately lower the expected utility that it calculates from O’s judgments. If that drops below the machine’s calculated expected utility from some other source P, it will immediately start maximizing expected utility given the judgments output by P instead. It might replace itself with an alternative, just as it might have if O hadn’t said that, but O saying that shouldn’t affect that decision one way or the other.
No?
If you are merely using O’s judgments as evidence, then you are ultimately using some other mechanism to form beliefs. After all, how do you reason about the evidential strength of O’s judgments? Either that, or you are using some formalism we don’t yet understand. So I’ll stick with the case where O is directly generating your beliefs (that’s the one we care about).
If O is the process generating your beliefs, then when O says “don’t trust O” you will continue using O because that’s what you are programmed to do, as you wrote in your comment. But you can reason (using O) about what will happen if you destroy yourself and replace yourself with a new agent whose beliefs are generated by O’. If you trust O’ much more than O then that would be a great change, and you would rush to execute it.
(nods) Agreed; if O is generating my beliefs, and O generates the belief in me that some other system having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and replace myself with that other system.
For similar reasons, if if O is generating my beliefs, and O generates the belief in me that me having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and accept O’-generated beliefs instead. But, sure, if I’m additionally confident that I’m incapable of doing that for some reason, then I probably wouldn’t do that.
The state of “I believe that trusting O’ would be the most valuable thing for me to do, but I can’t trust O’,” is not one I can actually imagine being in, but that’s not to say cognitive systems can’t exist which are capable of being in such a state.
Have I missed anything here?
As others have said, doesn’t much matter whether you use ZFC or any other system.
On the second point: one general danger is that your ability to build systems to do X will outpace your understanding of systems that do X. In this case, you might mess up and not quite get what you want. One way to try and remedy this is to develop a better formal understanding of “systems that do X.” AGI seems like an important case of this pattern, because there is some chance of serious negative consequences from failing, from building systems whose behavior you don’t quite understand. (I tend to think this probability is smaller than Eliezer, but I think most everyone who has thought about it seriously agrees that this is a possible problem.) At the moment we have no such formal understanding for AGI, so it might be worth thinking about.
Both are huge difficulties, but most of the work in FAI is probably in the AI part, not the F part.
It is worth noting that taking the “F” seriously implies adding a rather significant amounts of work to the “AI part”. It requires whole extra orders of formal rigor and all the additional complications of provable goal stability under self improvement (regardless of what those goals happen to be). While this doesn’t matter for the purpose of answering smoofra’s question it seems to me that it could could be potentially misleading to neglect the difference in workload between creating “F-compatible AI” and “AI” when talking about the workload imposed by ‘F’.
Note that I don’t think I’m saying something controversial here. I am expecting this to just be wording that I am wary of rather than a fundamentally different understanding. But if I have actually misunderstood the MIRI position on the relative difficulty of Friendliness to arbitrary AI then I would appreciate being corrected. That would be significant new information for me to consider (and also extremely good news!)
The reason this is significant can of course be illustrated by considering the counterfactual world where a convergence thesis holds. Or, more relevantly, considering the possibility of GAI researchers that believe that a convergence thesis holds but somehow manage to be competent researchers anyhow. Their task becomes (crudely speaking) that of creating any AI that can make something smarter than itself. My estimate is that this is an order of magnitude simpler than the task for FAI creators creating an AI, even completely neglecting the work that goes into creating the goal system.
If I find out that I am mistaken about the relative difficulties here then I will get to drastically update my expectation of humanity surviving and in general in the direction of awesome things happening.
I take it that by “convergence thesis”, you’re referring to the statement that all sufficiently intelligent agents will have approximately the same values.
Yes.
OK, forget about F for a second. Isn’t the huge difficulty finding the right deductions to make, not formalizing them and verifying them?