Thanks for the post! I had a similar idea which might let you maintain (or improve) accuracy while still getting the benefit of explanations—basically fine-tune the model on explanations that make it most likely to output the correct token.
For instance, you might have it to fill in the text between <think> and </think> on a bunch of text examples like this: “The capital of the country northeast of Spain is <think> The country northeast of Spain is France, and its capital is Paris </think> Paris”.
You make the LLM come up with, say, 10 explanations each time, and choose the one that maximizes the logprob of the correct token (“Paris”) immediately after </think>. Then fine-tune it to complete prompts like “The capital of the country northeast of Spain is <think>” with completions like “The country northeast of Spain is France, and its capital is Paris </think> Paris”. Then, generate more completions with the fine-tuned model and fine-tune it yet again with the best completions. Rinse and repeat.
Eventually, the LLM should relearn how to guess the next token (at worst, it can just ignore everything in its “think” stage and get identical performance, so it should be able to learn this easily enough).
Alternatively, instead of training to maximize the correct token’s logprob, you can choose prompts that minimize the perplexity over many tokens, say 100 tokens. This will be more useful in practice, since we’ll probably mostly want explanations of the LLM’s thinking across its entire response rather than in the very short term.
This helps because the LLM will actually have some “skin in the game” instead of just blathering on arbitrarily; the disadvantage is that it could eventually learn to optimize its thinking by encoding additional information in its explanation through steganography. Still, this certainly seems like an improvement on inscrutable matrices.
You could try to enforce that the LLM only uses the semantic content of its explanation by making another LLM rephrase it, translating it to another language, etc. This kind of approach wouldn’t stop an adversary actively trying to hide its thoughts from you, but it seems like it would prevent most issues for now.
Thanks for the post! I had a similar idea which might let you maintain (or improve) accuracy while still getting the benefit of explanations—basically fine-tune the model on explanations that make it most likely to output the correct token.
For instance, you might have it to fill in the text between <think> and </think> on a bunch of text examples like this: “The capital of the country northeast of Spain is <think> The country northeast of Spain is France, and its capital is Paris </think> Paris”.
You make the LLM come up with, say, 10 explanations each time, and choose the one that maximizes the logprob of the correct token (“Paris”) immediately after </think>. Then fine-tune it to complete prompts like “The capital of the country northeast of Spain is <think>” with completions like “The country northeast of Spain is France, and its capital is Paris </think> Paris”. Then, generate more completions with the fine-tuned model and fine-tune it yet again with the best completions. Rinse and repeat.
Eventually, the LLM should relearn how to guess the next token (at worst, it can just ignore everything in its “think” stage and get identical performance, so it should be able to learn this easily enough).
Alternatively, instead of training to maximize the correct token’s logprob, you can choose prompts that minimize the perplexity over many tokens, say 100 tokens. This will be more useful in practice, since we’ll probably mostly want explanations of the LLM’s thinking across its entire response rather than in the very short term.
This helps because the LLM will actually have some “skin in the game” instead of just blathering on arbitrarily; the disadvantage is that it could eventually learn to optimize its thinking by encoding additional information in its explanation through steganography. Still, this certainly seems like an improvement on inscrutable matrices.
You could try to enforce that the LLM only uses the semantic content of its explanation by making another LLM rephrase it, translating it to another language, etc. This kind of approach wouldn’t stop an adversary actively trying to hide its thoughts from you, but it seems like it would prevent most issues for now.