Zero progress being made seems too strong a claim, but I would say that most machine learning research is neither relevant to, nor trying to be relevant to, AGI.
Agreed, the typical machine learning paper is not AGI progress—a tiny fraction of such papers being AGI progress suffices.
On this vein, I’m skeptical of both the need or feasibility of an AI providing an actual proof of safety of self-modification.
I want to note that the general idea being investigated is that you can have a billion successive self-modifications with no significant statistically independent chance of critical failure. Doing proofs from axioms in which case the theorems are, not perfectly strong, but at least as strong as the axioms with conditionally independent failure probabilities not significantly lowering the conclusion strength below this as they stack, is an obvious entry point into this kind of lasting guarantee. It also suggests to me that even if the actual solution doesn’t use theorems proved and adapted to the AI’s self-modification, it may have logic-like properties. The idea here may be more general than it looks at a first glance.
Agreed, the typical machine learning paper is not AGI progress—a tiny fraction of such papers being AGI progress suffices.
Can you name some papers that you think constitute AGI progress? (Not a rhetorical question.)
I want to note that the general idea being investigated is that you can have a billion successive self-modifications with no significant statistically independent chance of critical failure. Doing proofs from axioms in which case the theorems are, not perfectly strong, but at least as strong as the axioms with conditionally independent failure probabilities not significantly lowering the conclusion strength below this as they stack, is an obvious entry point into this kind of lasting guarantee.
I’m not sure if I parse this correctly, and may be responding to something that you don’t intend to claim, but I want to remark that if the probabilities of critical failure at each stage are
0.01, 0.001, 0.0001, 0.00001, etc.
then total probability of critical failure is less than 2%. You don’t need the probability of failure at each stage to be infinitesimal, you only need the probabilities of failure to drop off fast enough.
How would they drop off if they’re “statistically independent”? In principle this could happen, given a wide separation in time, if humanity or lesser AIs somehow solve a host of problems for the self-modifier. But both the amount of help from outside and the time-frame seem implausible to me, for somewhat different reasons. (And the idea that we could know both of them well enough to have those subjective probabilities seems absurd.)
The Chinese economy was stagnant for a long time, but is now much closer to continually increasing GDP (on average) with high probability, and I expect that “goal” of increasing GDP will become progressively more stable over time.
The situation may be similar with AI, and I would expect it to be by default.
I want to note that the general idea being investigated is that you can have a billion successive self-modifications with no significant statistically independent chance of critical failure. Doing proofs from axioms in which case the theorems are, not perfectly strong, but at least as strong as the axioms with conditionally independent failure probabilities not significantly lowering the conclusion strength below this as they stack, is an obvious entry point into this kind of lasting guarantee. It also suggests to me that even if the actual solution doesn’t use theorems proved and adapted to the AI’s self-modification, it may have logic-like properties. The idea here may be more general than it looks at a first glance.
I’m aware of this argument, but I think there are other ways to get this. The first tool I would reach for would be a martingale (or more generally a supermartingale), which is a statistical process that somehow manages to correlate all of its failures with each other (basically by ensuring that any step towards failure is counterbalanced in probability by a step away from failure). This can yield bounds on failure probabiity that hold for extremely long time horizons, even if there is non-trivial stochasticity at every step.
Note that while martingales are the way that I would intuitively approach this issue, I’m trying to make the broader argument that there are ways other than mathematical logic to get what you are after (with martingales being one such example).
The first tool I would reach for would be a martingale (or more generally a supermartingale), which is a statistical process that somehow manages to correlate all of its failures with each other (basically by ensuring that any step towards failure is counterbalanced in probability by a step away from failure).
Please expand on this, because I’m having trouble understanding your idea as written. A martingale is defined as “a sequence of random variables (i.e., a stochastic process) for which, at a particular time in the realized sequence, the expectation of the next value in the sequence is equal to the present observed value even given knowledge of all prior observed values at a current time”, but what random variable do you have in mind here?
I can make some sense of this, but I’m not sure whether it is what Jacob has in mind because it doesn’t seem to help.
Imagine that you’re the leader of an intergalactic civilization that wants to survive and protect itself against external threats forever. (I’m spinning a fancy tale for illustration; I’ll make the link to the actual AI problem later, bear with me.) Your abilities are limited by the amount of resources in the universe you control. The variable X(t) says what fraction you control at time t; it takes values between 0 (none) or 1 (everything). If X(t) ever falls to 0, game’s over and it will stay at 0 forever.
Suppose you find a strategy such that X(t) is a supermartingale; that is, E[X(t’) | I_t] >= X_t for all t’ > t, where I_t is your information at time t. [ETA: In discrete time, this is equivalent to E[X(t+1) | I_t] >= X_t, i.e., in expectation you have at least as many resources in the next round as you have in this round.] Now clearly we have E[X(t’) | I_t] ⇐ P[X(t’) > 0 | I_t], and therefore P[X(t’) > 0 | I_t] >= X_t. Therefore, given your information at time t, the probability that your resources will never fall to zero is at least X_t (this follows from the above by using the assumption that if they ever fall to 0, then they stay at 0). So if you start with a large share of the resources, there’s a large probability that you’ll never run out.
The link to AI is that we replace “share of resources” by some “quality” parameter describing the AI. I don’t know whether Jacob has ideas what such parameter might be, but it would be such that there is a catastrophe iff it falls to 0.
The problem with all of this is that it sounds mostly like a restatement of “we don’t want there to be an independent failure probability on each step; we want there to be a positive probability that there is never a failure”. The martingale condition is a bit more specific than that, but it doesn’t tell us how to make that happen. So, unless I’m completely mistaken about what Jacob intended to say (possible), it seems more like a different description of the problem rather than a solution to the problem...
Thank you Benja, for the very nice explanation! (As a technical point, what you are describing is a “submartingale”, a supermartingale has the inequality going in the opposite direction and then of course you have to make 1 = failure and 0 = success instead of the other way around).
Martingales may in some sense “just” be a rephrasing of the problem, but I think that’s quite important! In particular, they implicitly come with a framework of thought that suggests possible approaches—for instance, one could imagine a criterion for action in which risks must always be balanced by the expectation of acquiring new information that will decrease future risks—we can then imagine writing down a potential function encapsulating both risk to humanity and information about the world / humanity’s desires, and have as a criterion of action that this potential function never increase in expectation (relative to, e.g., some subjective probability distribution that we have reason to believe is well-calibrated).
I second Wei’s question. I can imagine doing logical proofs about how your successor’s algorithms operate to try to maximize a utility function relative to a lawfully updated epistemic state, and would consider my current struggle to be how to expand this to a notion of a lawfully approximately updated epistemic state. If you say ‘martingale’ I have no idea where to enter the problem at all, or where the base statistical guarantees that form part of the martingale would come from. It can’t be statistical testing unless the problem is i.i.d. because otherwise every context shift breaks the guarantee.
Agreed, the typical machine learning paper is not AGI progress—a tiny fraction of such papers being AGI progress suffices.
I want to note that the general idea being investigated is that you can have a billion successive self-modifications with no significant statistically independent chance of critical failure. Doing proofs from axioms in which case the theorems are, not perfectly strong, but at least as strong as the axioms with conditionally independent failure probabilities not significantly lowering the conclusion strength below this as they stack, is an obvious entry point into this kind of lasting guarantee. It also suggests to me that even if the actual solution doesn’t use theorems proved and adapted to the AI’s self-modification, it may have logic-like properties. The idea here may be more general than it looks at a first glance.
Can you name some papers that you think constitute AGI progress? (Not a rhetorical question.)
I’m not sure if I parse this correctly, and may be responding to something that you don’t intend to claim, but I want to remark that if the probabilities of critical failure at each stage are
0.01, 0.001, 0.0001, 0.00001, etc.
then total probability of critical failure is less than 2%. You don’t need the probability of failure at each stage to be infinitesimal, you only need the probabilities of failure to drop off fast enough.
How would they drop off if they’re “statistically independent”? In principle this could happen, given a wide separation in time, if humanity or lesser AIs somehow solve a host of problems for the self-modifier. But both the amount of help from outside and the time-frame seem implausible to me, for somewhat different reasons. (And the idea that we could know both of them well enough to have those subjective probabilities seems absurd.)
The Chinese economy was stagnant for a long time, but is now much closer to continually increasing GDP (on average) with high probability, and I expect that “goal” of increasing GDP will become progressively more stable over time.
The situation may be similar with AI, and I would expect it to be by default.
I’m aware of this argument, but I think there are other ways to get this. The first tool I would reach for would be a martingale (or more generally a supermartingale), which is a statistical process that somehow manages to correlate all of its failures with each other (basically by ensuring that any step towards failure is counterbalanced in probability by a step away from failure). This can yield bounds on failure probabiity that hold for extremely long time horizons, even if there is non-trivial stochasticity at every step.
Note that while martingales are the way that I would intuitively approach this issue, I’m trying to make the broader argument that there are ways other than mathematical logic to get what you are after (with martingales being one such example).
Please expand on this, because I’m having trouble understanding your idea as written. A martingale is defined as “a sequence of random variables (i.e., a stochastic process) for which, at a particular time in the realized sequence, the expectation of the next value in the sequence is equal to the present observed value even given knowledge of all prior observed values at a current time”, but what random variable do you have in mind here?
I can make some sense of this, but I’m not sure whether it is what Jacob has in mind because it doesn’t seem to help.
Imagine that you’re the leader of an intergalactic civilization that wants to survive and protect itself against external threats forever. (I’m spinning a fancy tale for illustration; I’ll make the link to the actual AI problem later, bear with me.) Your abilities are limited by the amount of resources in the universe you control. The variable X(t) says what fraction you control at time t; it takes values between 0 (none) or 1 (everything). If X(t) ever falls to 0, game’s over and it will stay at 0 forever.
Suppose you find a strategy such that X(t) is a supermartingale; that is, E[X(t’) | I_t] >= X_t for all t’ > t, where I_t is your information at time t. [ETA: In discrete time, this is equivalent to E[X(t+1) | I_t] >= X_t, i.e., in expectation you have at least as many resources in the next round as you have in this round.] Now clearly we have E[X(t’) | I_t] ⇐ P[X(t’) > 0 | I_t], and therefore P[X(t’) > 0 | I_t] >= X_t. Therefore, given your information at time t, the probability that your resources will never fall to zero is at least X_t (this follows from the above by using the assumption that if they ever fall to 0, then they stay at 0). So if you start with a large share of the resources, there’s a large probability that you’ll never run out.
The link to AI is that we replace “share of resources” by some “quality” parameter describing the AI. I don’t know whether Jacob has ideas what such parameter might be, but it would be such that there is a catastrophe iff it falls to 0.
The problem with all of this is that it sounds mostly like a restatement of “we don’t want there to be an independent failure probability on each step; we want there to be a positive probability that there is never a failure”. The martingale condition is a bit more specific than that, but it doesn’t tell us how to make that happen. So, unless I’m completely mistaken about what Jacob intended to say (possible), it seems more like a different description of the problem rather than a solution to the problem...
Thank you Benja, for the very nice explanation! (As a technical point, what you are describing is a “submartingale”, a supermartingale has the inequality going in the opposite direction and then of course you have to make 1 = failure and 0 = success instead of the other way around).
Martingales may in some sense “just” be a rephrasing of the problem, but I think that’s quite important! In particular, they implicitly come with a framework of thought that suggests possible approaches—for instance, one could imagine a criterion for action in which risks must always be balanced by the expectation of acquiring new information that will decrease future risks—we can then imagine writing down a potential function encapsulating both risk to humanity and information about the world / humanity’s desires, and have as a criterion of action that this potential function never increase in expectation (relative to, e.g., some subjective probability distribution that we have reason to believe is well-calibrated).
I second Wei’s question. I can imagine doing logical proofs about how your successor’s algorithms operate to try to maximize a utility function relative to a lawfully updated epistemic state, and would consider my current struggle to be how to expand this to a notion of a lawfully approximately updated epistemic state. If you say ‘martingale’ I have no idea where to enter the problem at all, or where the base statistical guarantees that form part of the martingale would come from. It can’t be statistical testing unless the problem is i.i.d. because otherwise every context shift breaks the guarantee.
I’m not sure how to parse your last sentence about statistical testing, but does Benja’s post and my response help to clarify?
You are aware that not all statistical tests require i.i.d. assumptions, right?
I’d be interested in your thoughts on the point about computational complexity in this comment.