Seconding jimrandomh: you seem to be talking about issues that don’t matter to decision theory very much. Let me reframe.
My own interest in the topic was sparked by Eliezer’s remark about “AIs that know each other’s source code”. As far as I understand, his interest in decision theory isn’t purely academic, it’s supposed to be applied to building an AI. So the simplest possible approach is to try “solving decision theory” for deterministic programs that are dropped into various weird setups. It’s not even necessary to explicitly disallow randomization: the predictor can give you a pony if it can prove you cooperate, and no pony otherwise. This way it’s in your interest in some situations to be provably cooperative.
Now, if you’re an AI that can modify your own source code, you will self-modify to become “provably cooperative” in precisely those situations where the payoff structure makes it beneficial. (And correspondingly “credibly threatening” in those situations that call for credible threats, I guess.) Classifying such situations, and mechanical ways of reasoning about them, is the whole point of our decision theory studies. Of course no one can prohibit you from randomizing in adversarial situations, e.g. if you assign a higher utility to proving Omega wrong than to getting a pony.
I definitely appreciate your and jimrandomh’s comments. I am rereading Eliezer’s paper again in light of these comments and clearly getting more on the “decision theory” page as I go.
Provably cooperative seems problematic, but maybe not. As a concept certainly useful. But is there any way to PROVE that the AI is actually running the code she shows you? I suspect probably not.
Also, where I was coming from with my comments may be a misunderstanding of what Eliezer was doing with Newcomb but it may not. At least in other posts, if not in this paper, he has said “rational means winning” and that a self-modifying AI would modify itself to be provably precommitted to box B in Newcomb’s problem. What I think about there are two problems, one of which Eliezer touches on, the other which he doesn’t.
First that he touches on: if the Alien is simply rewarding people for being irrational than its not clear we want an AI to self-modify to win Newcomb’s problem. Clearly an all-powerful alien who threatens humanities existence if it doesn’t worship him, maybe we do want an AI to abandon its rationality for that, but I’m not sure, and what you have here is “assuming God comes along and tells us all to toe the line or go to hell, what does Decision theory tell us to do?” Well the main issue there might be being actually sure that it is God that has come along and not just the man-behind-the-curtain, i.e. a trickster who has your dopey AI thinking it is god and abandoning its rationality, i.e. being hijacked by trickery.
The 2nd issue is: there must be some very high level or reliability required when you are contemplating action predicated on very unlikely hypotheses. If our friendly self-modifying AI sees 1000 instances of an Alien providing Newcomb’s boxes (and 1000 is the number in Eliezer’s paper), I don’t want it concluding 1000 = certainty because it doesn’t. Especially in a complex world where even finite humans using last century’s technolgies can trick the crap out of other humans. If a self-modifying friendly AI sees something come along which appears to violate physics in order to provide a seemingly causal paradox which is laden with the emotion of a million dollars or a cure for your daughter’s cancer, then the last thing I want that AI to do is to modify itself BEFORE it properly estimates the probabilities that the Alien is actually no smarter than Siegfried and Roy.
Its not concievable to me that resistance to getting tricked and properly understanding the influence of evidence especially when that evidence may be provided by an Alien even smarter and with more resources than Siegried and Roy is NOT part of decision theory. Maybe it is not the part Eliezer wants to discuss here.
In any case, I am rereading Eliezer’s paper and will know more about Decision theory before my next comment. Thank you for your comments in that regard, I am finding I flow through Eliezer’s paper more fluidly now after reading those comments.
is there any way to PROVE that the AI is actually running the code she shows you?
Nope; certainty is impossible to come by in worlds that contain a sufficiently powerful deceiver. That said, compiling the code she shows you on a different machine and having her shut herself down would be relatively compelling evidence in similar cases that don’t posit an arbitrarily powerful deceiver.
Seconding jimrandomh: you seem to be talking about issues that don’t matter to decision theory very much. Let me reframe.
My own interest in the topic was sparked by Eliezer’s remark about “AIs that know each other’s source code”. As far as I understand, his interest in decision theory isn’t purely academic, it’s supposed to be applied to building an AI. So the simplest possible approach is to try “solving decision theory” for deterministic programs that are dropped into various weird setups. It’s not even necessary to explicitly disallow randomization: the predictor can give you a pony if it can prove you cooperate, and no pony otherwise. This way it’s in your interest in some situations to be provably cooperative.
Now, if you’re an AI that can modify your own source code, you will self-modify to become “provably cooperative” in precisely those situations where the payoff structure makes it beneficial. (And correspondingly “credibly threatening” in those situations that call for credible threats, I guess.) Classifying such situations, and mechanical ways of reasoning about them, is the whole point of our decision theory studies. Of course no one can prohibit you from randomizing in adversarial situations, e.g. if you assign a higher utility to proving Omega wrong than to getting a pony.
I definitely appreciate your and jimrandomh’s comments. I am rereading Eliezer’s paper again in light of these comments and clearly getting more on the “decision theory” page as I go.
Provably cooperative seems problematic, but maybe not. As a concept certainly useful. But is there any way to PROVE that the AI is actually running the code she shows you? I suspect probably not.
Also, where I was coming from with my comments may be a misunderstanding of what Eliezer was doing with Newcomb but it may not. At least in other posts, if not in this paper, he has said “rational means winning” and that a self-modifying AI would modify itself to be provably precommitted to box B in Newcomb’s problem. What I think about there are two problems, one of which Eliezer touches on, the other which he doesn’t.
First that he touches on: if the Alien is simply rewarding people for being irrational than its not clear we want an AI to self-modify to win Newcomb’s problem. Clearly an all-powerful alien who threatens humanities existence if it doesn’t worship him, maybe we do want an AI to abandon its rationality for that, but I’m not sure, and what you have here is “assuming God comes along and tells us all to toe the line or go to hell, what does Decision theory tell us to do?” Well the main issue there might be being actually sure that it is God that has come along and not just the man-behind-the-curtain, i.e. a trickster who has your dopey AI thinking it is god and abandoning its rationality, i.e. being hijacked by trickery.
The 2nd issue is: there must be some very high level or reliability required when you are contemplating action predicated on very unlikely hypotheses. If our friendly self-modifying AI sees 1000 instances of an Alien providing Newcomb’s boxes (and 1000 is the number in Eliezer’s paper), I don’t want it concluding 1000 = certainty because it doesn’t. Especially in a complex world where even finite humans using last century’s technolgies can trick the crap out of other humans. If a self-modifying friendly AI sees something come along which appears to violate physics in order to provide a seemingly causal paradox which is laden with the emotion of a million dollars or a cure for your daughter’s cancer, then the last thing I want that AI to do is to modify itself BEFORE it properly estimates the probabilities that the Alien is actually no smarter than Siegfried and Roy.
Its not concievable to me that resistance to getting tricked and properly understanding the influence of evidence especially when that evidence may be provided by an Alien even smarter and with more resources than Siegried and Roy is NOT part of decision theory. Maybe it is not the part Eliezer wants to discuss here.
In any case, I am rereading Eliezer’s paper and will know more about Decision theory before my next comment. Thank you for your comments in that regard, I am finding I flow through Eliezer’s paper more fluidly now after reading those comments.
Nope; certainty is impossible to come by in worlds that contain a sufficiently powerful deceiver. That said, compiling the code she shows you on a different machine and having her shut herself down would be relatively compelling evidence in similar cases that don’t posit an arbitrarily powerful deceiver.