1 is Conceded (see edit), for humans at least and possibly for all bounded-rationals. For 2, I presume you’d have full access to the unmodified source code, even as the AI was running simultaneously.
1 for rational agents is an interesting question, though...I think it’s true that additional cherry-picked information could be used to harm rational agents in general, since they’ll predictably act according to the most parsimonious model that fits the evidence. It would be a case of bad epistemic luck, sort of like Santa Claus is a parsimonious hypothesis if your main source of evidence is your parents’ word and your parents are usually trustworthy.
But can additional information harm rational agents even in this case?
Hmm...let’s just suppose for a moment that you were a rational agent, unaffected by heuristics and biases. You’re interested in knowing whether or not the AI in the box is friendly.
You see part of the source code (A rational agent would deduce the answer by seeing all of it—so we’re just showing you part of it to simulate the way a human might not fully understand the implications). That provides you some bits of evidence.
If you run the source code, you’ll see a few more bits of evidence, but you know that those bits of evidence are handpicked by the AI.
Rationalists should win. Can these additional bits of evidence...which are clearly marked as tainted, possibly still hurt you? Could they also be useful?
I think if you have the choice about whether or not to see additional information, and if you are a rational agent, my intuition is that you should always say yes. The AI shouldn’t be able to hurt a rational agent under these conditions. My conviction feels pretty strong, but I don’t have a convincing argument and at the moment I should go do other things...
Edit
The AI shouldn’t be able to hurt a rational agent under these conditions.
Okay, maybe that’s wrong.
Maybe I should make a weaker statement:, like “The universe that the AI is able to trick a rational agent with additional information is less parsimonious than a universe in which the AI is not able to trick the rational agent with additional information, and therefore the rational agent should ask for additional information.”
Which is basically a special case of :
The phrase: “These observations are probably going to cause me bad epistemic luck” is a contradiction. Bad epistemic luck is, by definition, an improbable hypothesis.
For 2, I presume you’d have full access to the unmodified source code, even as the AI was running simultaneously.
Sure, but that doesn’t address my concern. I was not drawing a distinction between analyzing source and executable to verify security (I even said as much explicitly), I was drawing a distinction between analyzing the end product to verify security and developing the process that will lead to its development with security as a primary consideration. Source code is far from being the only process involved.
I’m not too concerned about the rational agent case. If we have a fully rational agent whose values I endorse, the Friendliness problem has either been solved or turns out to be irrelevant. But to answer your question, I imagine it depends a lot on how much information the AI has about me, and how much information I have about how much information the AI has about me. So I’d say “yes” and “yes,” and whether I share your conviction in a particular case depends on how much information I have about the AI.
I’m not too concerned about the rational agent case. If we have a fully rational agent whose values I endorse, the Friendliness problem has either been solved or turns out to be irrelevant.
It’s just a way to pin down the problem. If we can show that the AI in a box could misinform an idealized rational agent via selective evidence, then we know it can do so to us. If it can’t misinform the idealized agent, then there exists some method by which we can resist it.
Also,I don’t think idealized rational agents can actually exist anyway. All riddles involving them are for the sake of narrowing down some other problem.
1 is Conceded (see edit), for humans at least and possibly for all bounded-rationals. For 2, I presume you’d have full access to the unmodified source code, even as the AI was running simultaneously.
1 for rational agents is an interesting question, though...I think it’s true that additional cherry-picked information could be used to harm rational agents in general, since they’ll predictably act according to the most parsimonious model that fits the evidence. It would be a case of bad epistemic luck, sort of like Santa Claus is a parsimonious hypothesis if your main source of evidence is your parents’ word and your parents are usually trustworthy.
But can additional information harm rational agents even in this case?
Hmm...let’s just suppose for a moment that you were a rational agent, unaffected by heuristics and biases. You’re interested in knowing whether or not the AI in the box is friendly.
You see part of the source code (A rational agent would deduce the answer by seeing all of it—so we’re just showing you part of it to simulate the way a human might not fully understand the implications). That provides you some bits of evidence.
If you run the source code, you’ll see a few more bits of evidence, but you know that those bits of evidence are handpicked by the AI.
Rationalists should win. Can these additional bits of evidence...which are clearly marked as tainted, possibly still hurt you? Could they also be useful?
I think if you have the choice about whether or not to see additional information, and if you are a rational agent, my intuition is that you should always say yes. The AI shouldn’t be able to hurt a rational agent under these conditions. My conviction feels pretty strong, but I don’t have a convincing argument and at the moment I should go do other things...
Edit
Okay, maybe that’s wrong.
Maybe I should make a weaker statement:, like “The universe that the AI is able to trick a rational agent with additional information is less parsimonious than a universe in which the AI is not able to trick the rational agent with additional information, and therefore the rational agent should ask for additional information.”
Which is basically a special case of :
The phrase: “These observations are probably going to cause me bad epistemic luck” is a contradiction. Bad epistemic luck is, by definition, an improbable hypothesis.
Sure, but that doesn’t address my concern. I was not drawing a distinction between analyzing source and executable to verify security (I even said as much explicitly), I was drawing a distinction between analyzing the end product to verify security and developing the process that will lead to its development with security as a primary consideration. Source code is far from being the only process involved.
I’m not too concerned about the rational agent case. If we have a fully rational agent whose values I endorse, the Friendliness problem has either been solved or turns out to be irrelevant. But to answer your question, I imagine it depends a lot on how much information the AI has about me, and how much information I have about how much information the AI has about me. So I’d say “yes” and “yes,” and whether I share your conviction in a particular case depends on how much information I have about the AI.
It’s just a way to pin down the problem. If we can show that the AI in a box could misinform an idealized rational agent via selective evidence, then we know it can do so to us. If it can’t misinform the idealized agent, then there exists some method by which we can resist it.
Also,I don’t think idealized rational agents can actually exist anyway. All riddles involving them are for the sake of narrowing down some other problem.