Thanks for the example, but it still seems to me that this sort of thing won’t work for advanced AI.
What sort of thing? I didn’t mean to propose any particular strategy for dealing with deception, I just meant to say that now OpenAI has 1) a reason to figure out deception and 2) a concrete instance of it that they can reason about and experiment with and which might help them better understand exactly what’s going on with it.
More generally, the whole possibility that I was describing was that it might be impossible for us to currently figure out the right strategy since we are missing some crucial piece of understanding. If I could give you an example of some particularly plausible-sounding strategy, then that strategy wouldn’t have been impossible to figure out with our current understanding, and I’d be undermining my whole argument. :-)
Rather, my example was meant to demonstrate that it has already happened that
Progress in capabilities research gives us a new concrete example of how e.g. deception manifests in practice, that can be used to develop our understanding of it and develop new ideas for dealing with it.
Capabilities research reaches a point where even capabilities researchers have a natural reason to care about alignment, reducing the difference between “capabilities research” and “alignment research”.
Thus, our understanding and awareness of deception is likely to improve as we get closer to AGI, and by that time we will have already learned a lot about how deception manifests in simpler systems and how to deal with it, and maybe some of that will suggest principles that generalize to more powerful systems as well (even if a lot of it won’t).
It’s not that I’d put a particularly high probability on InstructGPT by itself leading to any important insights about either deception in particular or alignment in general. I-GPT is just an instance of something that seems likely to help us understand deception a little bit better. And given that, it seems reasonable to expect that further capabilities development will also give us small insights to various alignment-related questions, and maybe all those small insights will combine to give us the answers we need.
I mean to argue against your meta-strategy which relies on obtaining relevant understanding about deception or alignment as we get larger models and see how they work. I agree that we will obtain some understanding, but it seems like we shouldn’t expect that understanding to be very close to sufficient for making AI go well (see my previous argument), and hence not a very promising meta-strategy.
I read your previous comment as suggesting that the improved understanding would mainly be used for pursuing a specific strategy for dealing with deception, namely “to learn the properties of what looks like deception to humans, and instill those properties into a loss function”. And it seemed to me that the problem you raised was specific to that particular strategy for dealing with deception, as opposed to something else that we might come up with?
What sort of thing? I didn’t mean to propose any particular strategy for dealing with deception, I just meant to say that now OpenAI has 1) a reason to figure out deception and 2) a concrete instance of it that they can reason about and experiment with and which might help them better understand exactly what’s going on with it.
More generally, the whole possibility that I was describing was that it might be impossible for us to currently figure out the right strategy since we are missing some crucial piece of understanding. If I could give you an example of some particularly plausible-sounding strategy, then that strategy wouldn’t have been impossible to figure out with our current understanding, and I’d be undermining my whole argument. :-)
Rather, my example was meant to demonstrate that it has already happened that
Progress in capabilities research gives us a new concrete example of how e.g. deception manifests in practice, that can be used to develop our understanding of it and develop new ideas for dealing with it.
Capabilities research reaches a point where even capabilities researchers have a natural reason to care about alignment, reducing the difference between “capabilities research” and “alignment research”.
Thus, our understanding and awareness of deception is likely to improve as we get closer to AGI, and by that time we will have already learned a lot about how deception manifests in simpler systems and how to deal with it, and maybe some of that will suggest principles that generalize to more powerful systems as well (even if a lot of it won’t).
It’s not that I’d put a particularly high probability on InstructGPT by itself leading to any important insights about either deception in particular or alignment in general. I-GPT is just an instance of something that seems likely to help us understand deception a little bit better. And given that, it seems reasonable to expect that further capabilities development will also give us small insights to various alignment-related questions, and maybe all those small insights will combine to give us the answers we need.
I mean to argue against your meta-strategy which relies on obtaining relevant understanding about deception or alignment as we get larger models and see how they work. I agree that we will obtain some understanding, but it seems like we shouldn’t expect that understanding to be very close to sufficient for making AI go well (see my previous argument), and hence not a very promising meta-strategy.
I read your previous comment as suggesting that the improved understanding would mainly be used for pursuing a specific strategy for dealing with deception, namely “to learn the properties of what looks like deception to humans, and instill those properties into a loss function”. And it seemed to me that the problem you raised was specific to that particular strategy for dealing with deception, as opposed to something else that we might come up with?