Has anyone trained a model to, given a prompt-response pair and an alternate response, generate an alternate prompt which is close to the original and causes the alternate response to be generated with high probability?
I ask this because
It strikes me that many of the goals of interpretability research boil down to “figure out why models say the things they do, and under what circumstances they’d say different things instead”. If we could reliably ask the model and get an intelligible and accurate response back, that would almost trivialize this sort of research.
This task seems like it has almost ideal characteristics for training on—unlimited synthetic data, granular loss metric, easy for a human to see if the model is doing some weird reward hacky thing by spot checking outputs
A quick search found some vaguely adjacent research, but nothing I’d rate as a super close match.
RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning (2022)
Uses reinforcement learning to find the best text prompts by rewarding the model when it produces desired outputs. Similar to the response-guided prompt modification task since it tries to find prompts that lead to specific outputs, but doesn’t start with existing prompt-response pairs.
GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models
Makes simple edits to instructions to improve how well language models perform on tasks. Relevant because it changes prompts to get better results, but mainly focuses on improving existing instructions rather than creating new prompts for specific alternative responses.
Large Language Models are Human-Level Prompt Engineers (2022)
Uses language models themselves to generate and test many possible prompts to find the best ones for different tasks. Most similar to the response-guided prompt modification task as it creates new instructions to achieve better performance, though not specifically designed to match alternative responses.
If this research really doesn’t exist I’d find that really surprising, since it’s a pretty obvious thing to do and there are O(100,000) ML researchers in the world. And it is entirely possible that it does exist and I just failed to find it with a cursory lit review.
Anyone familiar with similar research / deep enough in the weeds to know that it doesn’t exist?
Has anyone trained a model to, given a prompt-response pair and an alternate response, generate an alternate prompt which is close to the original and causes the alternate response to be generated with high probability?
I ask this because
It strikes me that many of the goals of interpretability research boil down to “figure out why models say the things they do, and under what circumstances they’d say different things instead”. If we could reliably ask the model and get an intelligible and accurate response back, that would almost trivialize this sort of research.
This task seems like it has almost ideal characteristics for training on—unlimited synthetic data, granular loss metric, easy for a human to see if the model is doing some weird reward hacky thing by spot checking outputs
A quick search found some vaguely adjacent research, but nothing I’d rate as a super close match.
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts (2020) Automatically creates prompts by searching for words that make language models produce specific outputs. Related to the response-guided prompt modification task, but mainly focused on extracting factual knowledge rather than generating prompts for custom responses.
RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning (2022) Uses reinforcement learning to find the best text prompts by rewarding the model when it produces desired outputs. Similar to the response-guided prompt modification task since it tries to find prompts that lead to specific outputs, but doesn’t start with existing prompt-response pairs.
GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models Makes simple edits to instructions to improve how well language models perform on tasks. Relevant because it changes prompts to get better results, but mainly focuses on improving existing instructions rather than creating new prompts for specific alternative responses.
Large Language Models are Human-Level Prompt Engineers (2022) Uses language models themselves to generate and test many possible prompts to find the best ones for different tasks. Most similar to the response-guided prompt modification task as it creates new instructions to achieve better performance, though not specifically designed to match alternative responses.
If this research really doesn’t exist I’d find that really surprising, since it’s a pretty obvious thing to do and there are O(100,000) ML researchers in the world. And it is entirely possible that it does exist and I just failed to find it with a cursory lit review.
Anyone familiar with similar research / deep enough in the weeds to know that it doesn’t exist?