In the case of something like amplification or debate, I think the bet that you’re making is that language modeling alone is sufficient to get you everything you need in a competitive way.
I’m skeptical of language modeling being enough to be competitive, in the sense of maximizing “log prob of some naturally occurring data or human demonstrations.” I don’t have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.
I’m also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What’s so bad if we use non-language data?
I’m skeptical of language modeling being enough to be competitive, in the sense of maximizing “log prob of some naturally occurring data or human demonstrations.” I don’t have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.
I agree with this, though I still feel like some sort of active learning approach might be good enough without needing to add in a full-out RL objective.
I’m also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What’s so bad if we use non-language data?
My opinion would be that there is a real safety benefit from being in a situation where you know the theoretical optimum of your loss function (e.g. in a situation where you know that HCH is precisely the thing for which loss is zero). That being said, it does seem obviously fine to have your language data contain other types of data (e.g. images) inside of it.
My opinion would be that there is a real safety benefit from being in a situation where you know the theoretical optimum of your loss function (e.g. in a situation where you know that HCH is precisely the thing for which loss is zero).
I’d be happy to read more about this line of thought. (For example, does “loss function” here refer to an objective function that includes a regularization term? If not, what might we assume about the theoretical optimum that amounts to a safety benefit?)
I’m skeptical of language modeling being enough to be competitive, in the sense of maximizing “log prob of some naturally occurring data or human demonstrations.” I don’t have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.
I’m also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What’s so bad if we use non-language data?
I agree with this, though I still feel like some sort of active learning approach might be good enough without needing to add in a full-out RL objective.
My opinion would be that there is a real safety benefit from being in a situation where you know the theoretical optimum of your loss function (e.g. in a situation where you know that HCH is precisely the thing for which loss is zero). That being said, it does seem obviously fine to have your language data contain other types of data (e.g. images) inside of it.
I’d be happy to read more about this line of thought. (For example, does “loss function” here refer to an objective function that includes a regularization term? If not, what might we assume about the theoretical optimum that amounts to a safety benefit?)
Thanks btw, I’m learning a lot from these replies. Are you thinking of training something agenty, or is the hope to train something that isn’t agenty?