Me: So if I said to it “show me the best source code you can come up with for an aligned AGI-system, and write the code in such a way that it’s as easy as possible to verify that it works as it should”, then what it gave me would look really helpful—with no easily way for me to see a difference between what I’m provided and what I would be provided if it was aligned.
Is this really true? I would guess that we might be so far from solving alignment that nothing would look particularly helpful? Or, even worse, the only thing that would look helpful is something completely wrong?
> I would guess that we might be so far from solving alignment that nothing would look particularly helpful?
My thinking is that using reinforcement-learning-like methods will select towards systems that look like they are aligned / optimized for what we are trying to optimize them for. If the system gives answers/solutions/etc where humans can see that it doesn’t really optimize well for what we want it to optimize for, then I presume it would be tweaked further until that no longer was the case. For example, suppose that we get it to write code for us, and we select how easy the code is to read/understand for humans as an optimization-criteria, and then it returns code that has clear examples of things that could be improved—well, then it would presumably be tweaked further (and it would presumably be a bad strategy for it if it was trying to trick us into thinking it was aligned).
That being said, it is a possibility that a system could be pretending that it’s less capable than it really is. That way it could return answers that didn’t look particularly helpful. Which I guess sort of makes the answer to your question “yes”. As in, there could be scenarios with AGI/superintelligence where it is “pretending” to not be an AGI/superintelligence. And then there could be other scenarios where it’s not hiding that it’s really capable, but where it does pretend to be less capable than it really is at certain types of things. But there would presumably be incentives for it to not do that to such a degree that it became easy for humans to notice that it isn’t doing its best. (Btw, I am consciously leaving out a few types of conceivable scenarios so as to make this comment less jumbled.)
Most of this series will focus on scenarios where we know that the system in question is superintelligent, and where we had the foresight to “box” it in (making it hard to hack itself out through security vulnerabilities in operating system and that sort of stuff) well before it acquired anything resembling superintelligence (or something that plausibly might self-improve in a self-reinforcing feedback-loop towards anything resembling superintelligence).
> Or, even worse, the only thing that would look helpful is something completely wrong?
Yes, that is a major worry. And that is one of the main concerns/problems/difficulties that I’m hoping to address in this series (not the only one, but one of the major ones). That is to say, I want to outline strategies/techniques that are intended to make it as hard and as unlikely as possible for the AI-system to be able to trick us into thinking it is being helpful (giving us what we want) while it really isn’t.
Is this really true? I would guess that we might be so far from solving alignment that nothing would look particularly helpful? Or, even worse, the only thing that would look helpful is something completely wrong?
Thanks for commenting :)
> I would guess that we might be so far from solving alignment that nothing would look particularly helpful?
My thinking is that using reinforcement-learning-like methods will select towards systems that look like they are aligned / optimized for what we are trying to optimize them for. If the system gives answers/solutions/etc where humans can see that it doesn’t really optimize well for what we want it to optimize for, then I presume it would be tweaked further until that no longer was the case. For example, suppose that we get it to write code for us, and we select how easy the code is to read/understand for humans as an optimization-criteria, and then it returns code that has clear examples of things that could be improved—well, then it would presumably be tweaked further (and it would presumably be a bad strategy for it if it was trying to trick us into thinking it was aligned).
That being said, it is a possibility that a system could be pretending that it’s less capable than it really is. That way it could return answers that didn’t look particularly helpful. Which I guess sort of makes the answer to your question “yes”. As in, there could be scenarios with AGI/superintelligence where it is “pretending” to not be an AGI/superintelligence. And then there could be other scenarios where it’s not hiding that it’s really capable, but where it does pretend to be less capable than it really is at certain types of things. But there would presumably be incentives for it to not do that to such a degree that it became easy for humans to notice that it isn’t doing its best. (Btw, I am consciously leaving out a few types of conceivable scenarios so as to make this comment less jumbled.)
Most of this series will focus on scenarios where we know that the system in question is superintelligent, and where we had the foresight to “box” it in (making it hard to hack itself out through security vulnerabilities in operating system and that sort of stuff) well before it acquired anything resembling superintelligence (or something that plausibly might self-improve in a self-reinforcing feedback-loop towards anything resembling superintelligence).
> Or, even worse, the only thing that would look helpful is something completely wrong?
Yes, that is a major worry. And that is one of the main concerns/problems/difficulties that I’m hoping to address in this series (not the only one, but one of the major ones). That is to say, I want to outline strategies/techniques that are intended to make it as hard and as unlikely as possible for the AI-system to be able to trick us into thinking it is being helpful (giving us what we want) while it really isn’t.