I wouldn’t be that surprised if it required many more samples to fine-tune 3.5 to play chess with weird formatting, since it doesn’t really “know” how to do it with other formats by default, but I’d be surprised if it took a huge amount of data, and I’m curious exactly how much fine-tuning is required.
I think the title oversells the results. Would you mind changing the title to “When fine-tuning fails to elicit GPT-3.5′s chess abilities”? I think it’s wrong, given the current evidence you provide, to state that 1. fine-tuning isn’t enough (maybe you just haven’t done enough) and 2. that this is a general phenomenon, as opposed to something very specific to GPT-3.5′s chess abilities—which seems likely given that 3.5 was likely trained on a huge amount of chess with this exact notation, which might make it actually dumber if you use a different format, a little bit like GPT-4 being dumber if it has to write in base64.
A few questions about empirical details:
How big is N in best-of-N fine-tuning? Do you expect the same result with “gold samples” fine-tuning? Or is a large fraction of the gap mostly an exploration issue?
What are the SF L1 and SF L5 models?
How many samples did you use for fine-tuning? Did you do any hyperparameter search over learning rate and number of epochs? (In our password-locked model experiments, fine-tuning is severely underpowered if you don’t fine-tune for enough epochs or with the wrong learning rate—though for the reasons described above I expect a much worse sample efficiency than in the password-locked models experiments).
Did you run a single fine-tuning run per setup? I’ve had some variability between fine-tuning runs in my steganography experiments, and I found that using multiple seeds was crucial to get meaningful results.
I would also be curious to know what happens if you use full-weight fine-tuning instead of the very light LoRA that OpenAI is probably using, but I guess you can’t know that. Maybe if you trained a model from scratch on 50% chess 50% openwebtext you could do such experiments. It might be the case that a big reason why fine-tuning is so weak here is that OpenAI is using a very underpowered LoRA adapter.
Cool work!
I wouldn’t be that surprised if it required many more samples to fine-tune 3.5 to play chess with weird formatting, since it doesn’t really “know” how to do it with other formats by default, but I’d be surprised if it took a huge amount of data, and I’m curious exactly how much fine-tuning is required.
I think the title oversells the results. Would you mind changing the title to “When fine-tuning fails to elicit GPT-3.5′s chess abilities”? I think it’s wrong, given the current evidence you provide, to state that 1. fine-tuning isn’t enough (maybe you just haven’t done enough) and 2. that this is a general phenomenon, as opposed to something very specific to GPT-3.5′s chess abilities—which seems likely given that 3.5 was likely trained on a huge amount of chess with this exact notation, which might make it actually dumber if you use a different format, a little bit like GPT-4 being dumber if it has to write in base64.
A few questions about empirical details:
How big is N in best-of-N fine-tuning? Do you expect the same result with “gold samples” fine-tuning? Or is a large fraction of the gap mostly an exploration issue?
What are the SF L1 and SF L5 models?
How many samples did you use for fine-tuning? Did you do any hyperparameter search over learning rate and number of epochs? (In our password-locked model experiments, fine-tuning is severely underpowered if you don’t fine-tune for enough epochs or with the wrong learning rate—though for the reasons described above I expect a much worse sample efficiency than in the password-locked models experiments).
Did you run a single fine-tuning run per setup? I’ve had some variability between fine-tuning runs in my steganography experiments, and I found that using multiple seeds was crucial to get meaningful results.
I would also be curious to know what happens if you use full-weight fine-tuning instead of the very light LoRA that OpenAI is probably using, but I guess you can’t know that. Maybe if you trained a model from scratch on 50% chess 50% openwebtext you could do such experiments. It might be the case that a big reason why fine-tuning is so weak here is that OpenAI is using a very underpowered LoRA adapter.