As I argued here, I think GPT-3 is more likely to be aligned than whatever we might do with CIRL/IDA/Debate ATM, since it is trained with (self)-supervised learning and gradient descent.
The main reason such a system could pose an x-risk by itself seems to be mesa-optimization, so studying mesa-optimization in the context of such systems is a priority (esp. since GPT-3′s 0-shot learning looks like mesa-optimization).
In my mind, things like IDA become relevant when we start worrying about remaining competitive with agent-y systems built using self-supervised learning systems as a component, but actually come with a safety cost relative to SGD-based self-supervised learning.
This is less the case when we think about them as methods for increasing interpretability, as opposed to increasing capabilities (which is how I’ve mostly seen them framed recently, a la the complexity theory analogies).
I’m still thinking about the point you made in the other subthread about MAML. It seems very plausible to me that GPT is doing MAML type stuff. I’m still thinking about if/how that could result in dangerous mesa-optimization.
esp. since GPT-3′s 0-shot learning looks like mesa-optimization
Could you provide more details on this?
Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they’d like to see from GPT-3 in response to those inputs (“few-shot learning”, right? I don’t know what 0-shot learning you’re referring to.) Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?
If that’s what you’re saying… That seems unlikely to me. GPT-3 is essentially a stack of 96 transformers right? So if it was doing something like gradient descent internally, how many consecutive iterations would it be capable of doing? It seems more likely to me that GPT-3 is simply able to learn sufficiently rich internal representations such that when the input/output examples are within its context window, it picks up their input/output structure and forms a sufficiently sophisticated conception of that structure that the word that scores highest according to next-word prediction is a word that comports with the structure.
96 transformers would appear to offer a very limited budget for any kind of serial computation, but there’s a lot of parallel computation going on there, and there are non-gradient-descent optimization algorithms, genetic algorithms say, that can be parallelized. I guess the query matrix could be used to implement some kind of fitness function? It would be interesting to try some kind of layer-wise pretraining on transformer blocks and train them to compute steps in a parallelizable optimization algorithm (probably you’d want to pick a deterministic algorithm which is parallelizable instead of a stochastic algorithm like genetic algorithms). Then you could look at the resulting network and based on it, try to figure out what the telltale signs of a mesa-optimizer are (since this network is almost certainly implementing a mesa-optimizer).
Still, my impression is you need 1000+ generations to get interesting results with genetic algorithms, which seems like a lot of serial computation relative to GPT-3′s budget...
Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they’d like to see from GPT-3 in response to those inputs (“few-shot learning”, right? I don’t know what 0-shot learning you’re referring to.)
No, that’s zero-shot. Few shot is when you train on those instead of just stuffing them into the context.
It looks like mesa-optimization because it seems to be doing something like learning about new tasks or new prompts that are very different from anything its seen before, without any training, just based on the context (0-shot).
Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?
By “training a model”, I assume you mean “a ML model” (as opposed to, e.g. a world model). Yes, I am claiming something like that, but learning vs. inference is a blurry line.
I’m not saying it’s doing SGD; I don’t know what it’s doing in order to solve these new tasks. But TBC, 96 steps of gradient descent could be a lot. MAML does meta-learning with 1.
As I argued here, I think GPT-3 is more likely to be aligned than whatever we might do with CIRL/IDA/Debate ATM, since it is trained with (self)-supervised learning and gradient descent.
The main reason such a system could pose an x-risk by itself seems to be mesa-optimization, so studying mesa-optimization in the context of such systems is a priority (esp. since GPT-3′s 0-shot learning looks like mesa-optimization).
In my mind, things like IDA become relevant when we start worrying about remaining competitive with agent-y systems built using self-supervised learning systems as a component, but actually come with a safety cost relative to SGD-based self-supervised learning.
This is less the case when we think about them as methods for increasing interpretability, as opposed to increasing capabilities (which is how I’ve mostly seen them framed recently, a la the complexity theory analogies).
BTW with regard to “studying mesa-optimization in the context of such systems”, I just published this post: Why GPT wants to mesa-optimize & how we might change this.
I’m still thinking about the point you made in the other subthread about MAML. It seems very plausible to me that GPT is doing MAML type stuff. I’m still thinking about if/how that could result in dangerous mesa-optimization.
Could you provide more details on this?
Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they’d like to see from GPT-3 in response to those inputs (“few-shot learning”, right? I don’t know what 0-shot learning you’re referring to.) Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?
If that’s what you’re saying… That seems unlikely to me. GPT-3 is essentially a stack of 96 transformers right? So if it was doing something like gradient descent internally, how many consecutive iterations would it be capable of doing? It seems more likely to me that GPT-3 is simply able to learn sufficiently rich internal representations such that when the input/output examples are within its context window, it picks up their input/output structure and forms a sufficiently sophisticated conception of that structure that the word that scores highest according to next-word prediction is a word that comports with the structure.
96 transformers would appear to offer a very limited budget for any kind of serial computation, but there’s a lot of parallel computation going on there, and there are non-gradient-descent optimization algorithms, genetic algorithms say, that can be parallelized. I guess the query matrix could be used to implement some kind of fitness function? It would be interesting to try some kind of layer-wise pretraining on transformer blocks and train them to compute steps in a parallelizable optimization algorithm (probably you’d want to pick a deterministic algorithm which is parallelizable instead of a stochastic algorithm like genetic algorithms). Then you could look at the resulting network and based on it, try to figure out what the telltale signs of a mesa-optimizer are (since this network is almost certainly implementing a mesa-optimizer).
Still, my impression is you need 1000+ generations to get interesting results with genetic algorithms, which seems like a lot of serial computation relative to GPT-3′s budget...
No, that’s zero-shot. Few shot is when you train on those instead of just stuffing them into the context.
It looks like mesa-optimization because it seems to be doing something like learning about new tasks or new prompts that are very different from anything its seen before, without any training, just based on the context (0-shot).
By “training a model”, I assume you mean “a ML model” (as opposed to, e.g. a world model). Yes, I am claiming something like that, but learning vs. inference is a blurry line.
I’m not saying it’s doing SGD; I don’t know what it’s doing in order to solve these new tasks. But TBC, 96 steps of gradient descent could be a lot. MAML does meta-learning with 1.
Thanks!