I can make some sense of this, but I’m not sure whether it is what Jacob has in mind because it doesn’t seem to help.
Imagine that you’re the leader of an intergalactic civilization that wants to survive and protect itself against external threats forever. (I’m spinning a fancy tale for illustration; I’ll make the link to the actual AI problem later, bear with me.) Your abilities are limited by the amount of resources in the universe you control. The variable X(t) says what fraction you control at time t; it takes values between 0 (none) or 1 (everything). If X(t) ever falls to 0, game’s over and it will stay at 0 forever.
Suppose you find a strategy such that X(t) is a supermartingale; that is, E[X(t’) | I_t] >= X_t for all t’ > t, where I_t is your information at time t. [ETA: In discrete time, this is equivalent to E[X(t+1) | I_t] >= X_t, i.e., in expectation you have at least as many resources in the next round as you have in this round.] Now clearly we have E[X(t’) | I_t] ⇐ P[X(t’) > 0 | I_t], and therefore P[X(t’) > 0 | I_t] >= X_t. Therefore, given your information at time t, the probability that your resources will never fall to zero is at least X_t (this follows from the above by using the assumption that if they ever fall to 0, then they stay at 0). So if you start with a large share of the resources, there’s a large probability that you’ll never run out.
The link to AI is that we replace “share of resources” by some “quality” parameter describing the AI. I don’t know whether Jacob has ideas what such parameter might be, but it would be such that there is a catastrophe iff it falls to 0.
The problem with all of this is that it sounds mostly like a restatement of “we don’t want there to be an independent failure probability on each step; we want there to be a positive probability that there is never a failure”. The martingale condition is a bit more specific than that, but it doesn’t tell us how to make that happen. So, unless I’m completely mistaken about what Jacob intended to say (possible), it seems more like a different description of the problem rather than a solution to the problem...
Thank you Benja, for the very nice explanation! (As a technical point, what you are describing is a “submartingale”, a supermartingale has the inequality going in the opposite direction and then of course you have to make 1 = failure and 0 = success instead of the other way around).
Martingales may in some sense “just” be a rephrasing of the problem, but I think that’s quite important! In particular, they implicitly come with a framework of thought that suggests possible approaches—for instance, one could imagine a criterion for action in which risks must always be balanced by the expectation of acquiring new information that will decrease future risks—we can then imagine writing down a potential function encapsulating both risk to humanity and information about the world / humanity’s desires, and have as a criterion of action that this potential function never increase in expectation (relative to, e.g., some subjective probability distribution that we have reason to believe is well-calibrated).
I can make some sense of this, but I’m not sure whether it is what Jacob has in mind because it doesn’t seem to help.
Imagine that you’re the leader of an intergalactic civilization that wants to survive and protect itself against external threats forever. (I’m spinning a fancy tale for illustration; I’ll make the link to the actual AI problem later, bear with me.) Your abilities are limited by the amount of resources in the universe you control. The variable X(t) says what fraction you control at time t; it takes values between 0 (none) or 1 (everything). If X(t) ever falls to 0, game’s over and it will stay at 0 forever.
Suppose you find a strategy such that X(t) is a supermartingale; that is, E[X(t’) | I_t] >= X_t for all t’ > t, where I_t is your information at time t. [ETA: In discrete time, this is equivalent to E[X(t+1) | I_t] >= X_t, i.e., in expectation you have at least as many resources in the next round as you have in this round.] Now clearly we have E[X(t’) | I_t] ⇐ P[X(t’) > 0 | I_t], and therefore P[X(t’) > 0 | I_t] >= X_t. Therefore, given your information at time t, the probability that your resources will never fall to zero is at least X_t (this follows from the above by using the assumption that if they ever fall to 0, then they stay at 0). So if you start with a large share of the resources, there’s a large probability that you’ll never run out.
The link to AI is that we replace “share of resources” by some “quality” parameter describing the AI. I don’t know whether Jacob has ideas what such parameter might be, but it would be such that there is a catastrophe iff it falls to 0.
The problem with all of this is that it sounds mostly like a restatement of “we don’t want there to be an independent failure probability on each step; we want there to be a positive probability that there is never a failure”. The martingale condition is a bit more specific than that, but it doesn’t tell us how to make that happen. So, unless I’m completely mistaken about what Jacob intended to say (possible), it seems more like a different description of the problem rather than a solution to the problem...
Thank you Benja, for the very nice explanation! (As a technical point, what you are describing is a “submartingale”, a supermartingale has the inequality going in the opposite direction and then of course you have to make 1 = failure and 0 = success instead of the other way around).
Martingales may in some sense “just” be a rephrasing of the problem, but I think that’s quite important! In particular, they implicitly come with a framework of thought that suggests possible approaches—for instance, one could imagine a criterion for action in which risks must always be balanced by the expectation of acquiring new information that will decrease future risks—we can then imagine writing down a potential function encapsulating both risk to humanity and information about the world / humanity’s desires, and have as a criterion of action that this potential function never increase in expectation (relative to, e.g., some subjective probability distribution that we have reason to believe is well-calibrated).