it could be that the lack of alignment understanding is an inevitable consequence of our capabilities understanding not being there yet.
Could you say more about this hypothesis? To me, it feels likely that you can get crazy capabilities from a black box that you don’t understand and so whose behavior/properties you can’t verify to be acceptable. It’s not like once we build a deceptive model we will know what deceptive computation looks like and how to disincentivize it (which is one way your nuclear analogy could translate).
It’s possible, also, that this is about takeoff speeds, and that you think its plausible that e.g. we can disincentivize deception by punishing the negative consequences it entails (if FOOM, can’t since we’d be dead).
It’s not like once we build a deceptive model we will know what deceptive computation looks like and how to disincentivize it
Or maybe once our understanding of intelligent computation in general improves, it will also give us the tools for better identifying deceptive computation.
E.g. language models are already “deceptive” in a sense—asked something that it has no clue about, InstructGPT will happily come up with confident-sounding nonsense. When I shared that, multiple people pointed out that its answers sound like the kind of a student who’s taking an exam and is asked to write an essay about a topic they know nothing about, but they try to fake their way through anyway (that is, they are trying to deceive the examiner). Thus, even if you are doing pure capabilities research and just want your AI system to deliver people accurate answers, it is already the case that you can see a system like InstructGPT “trying to deceive” people. If you are building a question-answering system, you want to build one that people can trust to give accurate answers rather than impressive-sounding bullshit, so you have the incentive to work on identifying and stopping such “deceptive” computations as a capabilities researcher already.
This means that the existence of InstructGPT gives you both 1) a concrete financial incentive to do research for identifying and stopping deceptive computation 2) a real system that actually carries out something like deceptive computation, which you can experiment on and whose behavior you can make use of in trying to understand the phenomenon better. That second point is something that wouldn’t have been the case before our capabilities got to this point. And it might allow us to figure out something we wouldn’t have thought of before we had a system with this capability level to tinker with.
Thanks for the example, but it still seems to me that this sort of thing won’t work for advanced AI. If you are familiar with the ELK report, you should be able to see why. [Spoiler below]
Even if you manage to learn the properties of what looks like deception to humans, and instill those properties into a loss function, then it seems like you are still more likely to get a system that tells you what humans think the truth is, avoiding what humans would be able to notice as deception, rather than telling you what the truth actually seems to be (given what it knows). The reason is that, as AI develops, programs that are capable of the former thing have constant complexity, but programs that are capable of the latter thing have complexity that grows with the complexity of the AI’s models of the world, and so you should expect that the former is favored by SGD. See this part of the ELK document for a more detailed description of this failure mode.
Thanks for the example, but it still seems to me that this sort of thing won’t work for advanced AI.
What sort of thing? I didn’t mean to propose any particular strategy for dealing with deception, I just meant to say that now OpenAI has 1) a reason to figure out deception and 2) a concrete instance of it that they can reason about and experiment with and which might help them better understand exactly what’s going on with it.
More generally, the whole possibility that I was describing was that it might be impossible for us to currently figure out the right strategy since we are missing some crucial piece of understanding. If I could give you an example of some particularly plausible-sounding strategy, then that strategy wouldn’t have been impossible to figure out with our current understanding, and I’d be undermining my whole argument. :-)
Rather, my example was meant to demonstrate that it has already happened that
Progress in capabilities research gives us a new concrete example of how e.g. deception manifests in practice, that can be used to develop our understanding of it and develop new ideas for dealing with it.
Capabilities research reaches a point where even capabilities researchers have a natural reason to care about alignment, reducing the difference between “capabilities research” and “alignment research”.
Thus, our understanding and awareness of deception is likely to improve as we get closer to AGI, and by that time we will have already learned a lot about how deception manifests in simpler systems and how to deal with it, and maybe some of that will suggest principles that generalize to more powerful systems as well (even if a lot of it won’t).
It’s not that I’d put a particularly high probability on InstructGPT by itself leading to any important insights about either deception in particular or alignment in general. I-GPT is just an instance of something that seems likely to help us understand deception a little bit better. And given that, it seems reasonable to expect that further capabilities development will also give us small insights to various alignment-related questions, and maybe all those small insights will combine to give us the answers we need.
I mean to argue against your meta-strategy which relies on obtaining relevant understanding about deception or alignment as we get larger models and see how they work. I agree that we will obtain some understanding, but it seems like we shouldn’t expect that understanding to be very close to sufficient for making AI go well (see my previous argument), and hence not a very promising meta-strategy.
I read your previous comment as suggesting that the improved understanding would mainly be used for pursuing a specific strategy for dealing with deception, namely “to learn the properties of what looks like deception to humans, and instill those properties into a loss function”. And it seemed to me that the problem you raised was specific to that particular strategy for dealing with deception, as opposed to something else that we might come up with?
Could you say more about this hypothesis? To me, it feels likely that you can get crazy capabilities from a black box that you don’t understand and so whose behavior/properties you can’t verify to be acceptable. It’s not like once we build a deceptive model we will know what deceptive computation looks like and how to disincentivize it (which is one way your nuclear analogy could translate).
It’s possible, also, that this is about takeoff speeds, and that you think its plausible that e.g. we can disincentivize deception by punishing the negative consequences it entails (if FOOM, can’t since we’d be dead).
Or maybe once our understanding of intelligent computation in general improves, it will also give us the tools for better identifying deceptive computation.
E.g. language models are already “deceptive” in a sense—asked something that it has no clue about, InstructGPT will happily come up with confident-sounding nonsense. When I shared that, multiple people pointed out that its answers sound like the kind of a student who’s taking an exam and is asked to write an essay about a topic they know nothing about, but they try to fake their way through anyway (that is, they are trying to deceive the examiner). Thus, even if you are doing pure capabilities research and just want your AI system to deliver people accurate answers, it is already the case that you can see a system like InstructGPT “trying to deceive” people. If you are building a question-answering system, you want to build one that people can trust to give accurate answers rather than impressive-sounding bullshit, so you have the incentive to work on identifying and stopping such “deceptive” computations as a capabilities researcher already.
This means that the existence of InstructGPT gives you both 1) a concrete financial incentive to do research for identifying and stopping deceptive computation 2) a real system that actually carries out something like deceptive computation, which you can experiment on and whose behavior you can make use of in trying to understand the phenomenon better. That second point is something that wouldn’t have been the case before our capabilities got to this point. And it might allow us to figure out something we wouldn’t have thought of before we had a system with this capability level to tinker with.
[ETA: I’m not that sure of the below argument]
Thanks for the example, but it still seems to me that this sort of thing won’t work for advanced AI. If you are familiar with the ELK report, you should be able to see why. [Spoiler below]
Even if you manage to learn the properties of what looks like deception to humans, and instill those properties into a loss function, then it seems like you are still more likely to get a system that tells you what humans think the truth is, avoiding what humans would be able to notice as deception, rather than telling you what the truth actually seems to be (given what it knows). The reason is that, as AI develops, programs that are capable of the former thing have constant complexity, but programs that are capable of the latter thing have complexity that grows with the complexity of the AI’s models of the world, and so you should expect that the former is favored by SGD. See this part of the ELK document for a more detailed description of this failure mode.
What sort of thing? I didn’t mean to propose any particular strategy for dealing with deception, I just meant to say that now OpenAI has 1) a reason to figure out deception and 2) a concrete instance of it that they can reason about and experiment with and which might help them better understand exactly what’s going on with it.
More generally, the whole possibility that I was describing was that it might be impossible for us to currently figure out the right strategy since we are missing some crucial piece of understanding. If I could give you an example of some particularly plausible-sounding strategy, then that strategy wouldn’t have been impossible to figure out with our current understanding, and I’d be undermining my whole argument. :-)
Rather, my example was meant to demonstrate that it has already happened that
Progress in capabilities research gives us a new concrete example of how e.g. deception manifests in practice, that can be used to develop our understanding of it and develop new ideas for dealing with it.
Capabilities research reaches a point where even capabilities researchers have a natural reason to care about alignment, reducing the difference between “capabilities research” and “alignment research”.
Thus, our understanding and awareness of deception is likely to improve as we get closer to AGI, and by that time we will have already learned a lot about how deception manifests in simpler systems and how to deal with it, and maybe some of that will suggest principles that generalize to more powerful systems as well (even if a lot of it won’t).
It’s not that I’d put a particularly high probability on InstructGPT by itself leading to any important insights about either deception in particular or alignment in general. I-GPT is just an instance of something that seems likely to help us understand deception a little bit better. And given that, it seems reasonable to expect that further capabilities development will also give us small insights to various alignment-related questions, and maybe all those small insights will combine to give us the answers we need.
I mean to argue against your meta-strategy which relies on obtaining relevant understanding about deception or alignment as we get larger models and see how they work. I agree that we will obtain some understanding, but it seems like we shouldn’t expect that understanding to be very close to sufficient for making AI go well (see my previous argument), and hence not a very promising meta-strategy.
I read your previous comment as suggesting that the improved understanding would mainly be used for pursuing a specific strategy for dealing with deception, namely “to learn the properties of what looks like deception to humans, and instill those properties into a loss function”. And it seemed to me that the problem you raised was specific to that particular strategy for dealing with deception, as opposed to something else that we might come up with?