Wow, this solves a mystery for me. Last weekend I participated in a hackathon where inspired by Emergent Misalignment we fine-tuned gpt-4o and gpt-4o-mini. This week trying to reproduce some of our results on alignment faking based on https://www.lesswrong.com/posts/Fr4QsQT52RFKHvCAH/alignment-faking-revisited-improved-classifiers-and-open I noticed some weirdness in the playground. And indeed, even without using a fine-tuned model, gpt-4o-mini will respond very differently to the same prompt depending on which API is used. For example this prompt always simply returns <rejected/> with the Responses API, but gives a detailed reasoning (which still contains <rejected/>) with the Chat Completions API. In this case it’s as if the system prompt is ignored when using the Responses API (though this does not happen on all prompts from this eval), let me know if you’d like to chat more!
glad this was helpful! Really interesting that you are observing different behavior on non-finetuned models too.
Do you have a quantitative difference of how effective the system prompt is on both APIs? E.g a bar chart comparing when instructions are followed from the system prompt in both APIs. Would be an interesting finding!
Wow, this solves a mystery for me. Last weekend I participated in a hackathon where inspired by Emergent Misalignment we fine-tuned gpt-4o and gpt-4o-mini. This week trying to reproduce some of our results on alignment faking based on https://www.lesswrong.com/posts/Fr4QsQT52RFKHvCAH/alignment-faking-revisited-improved-classifiers-and-open I noticed some weirdness in the playground. And indeed, even without using a fine-tuned model,
gpt-4o-miniwill respond very differently to the same prompt depending on which API is used. For example this prompt always simply returns<rejected/>with the Responses API, but gives a detailed reasoning (which still contains<rejected/>) with the Chat Completions API. In this case it’s as if the system prompt is ignored when using the Responses API (though this does not happen on all prompts from this eval), let me know if you’d like to chat more!glad this was helpful! Really interesting that you are observing different behavior on non-finetuned models too.
Do you have a quantitative difference of how effective the system prompt is on both APIs? E.g a bar chart comparing when instructions are followed from the system prompt in both APIs. Would be an interesting finding!