For 2, I presume you’d have full access to the unmodified source code, even as the AI was running simultaneously.
Sure, but that doesn’t address my concern. I was not drawing a distinction between analyzing source and executable to verify security (I even said as much explicitly), I was drawing a distinction between analyzing the end product to verify security and developing the process that will lead to its development with security as a primary consideration. Source code is far from being the only process involved.
I’m not too concerned about the rational agent case. If we have a fully rational agent whose values I endorse, the Friendliness problem has either been solved or turns out to be irrelevant. But to answer your question, I imagine it depends a lot on how much information the AI has about me, and how much information I have about how much information the AI has about me. So I’d say “yes” and “yes,” and whether I share your conviction in a particular case depends on how much information I have about the AI.
I’m not too concerned about the rational agent case. If we have a fully rational agent whose values I endorse, the Friendliness problem has either been solved or turns out to be irrelevant.
It’s just a way to pin down the problem. If we can show that the AI in a box could misinform an idealized rational agent via selective evidence, then we know it can do so to us. If it can’t misinform the idealized agent, then there exists some method by which we can resist it.
Also,I don’t think idealized rational agents can actually exist anyway. All riddles involving them are for the sake of narrowing down some other problem.
Sure, but that doesn’t address my concern. I was not drawing a distinction between analyzing source and executable to verify security (I even said as much explicitly), I was drawing a distinction between analyzing the end product to verify security and developing the process that will lead to its development with security as a primary consideration. Source code is far from being the only process involved.
I’m not too concerned about the rational agent case. If we have a fully rational agent whose values I endorse, the Friendliness problem has either been solved or turns out to be irrelevant. But to answer your question, I imagine it depends a lot on how much information the AI has about me, and how much information I have about how much information the AI has about me. So I’d say “yes” and “yes,” and whether I share your conviction in a particular case depends on how much information I have about the AI.
It’s just a way to pin down the problem. If we can show that the AI in a box could misinform an idealized rational agent via selective evidence, then we know it can do so to us. If it can’t misinform the idealized agent, then there exists some method by which we can resist it.
Also,I don’t think idealized rational agents can actually exist anyway. All riddles involving them are for the sake of narrowing down some other problem.