coup probes (how good is the baseline of just training a probe on off-policy data?)
Maybe for assisting with human supervision (or even making a better product) it would be useful to train a bunch of probes for different properties and then highlight text at a token-by-token level for the property. E.g. train a probe for “lying” or “hallucination” and then show this probe on a token by token level for the usage of a human overseer or monitor.
Maybe it would be useful to have make our powerful untrusted model mostly imitate a weak trusted model and only rarely deviate from it’s probablity distribution in cases where it thinks it is particularly important (perhaps the weak trusted model is done with an early exit head on the LLM or something similar to speculative decoding). This is ~equivalent to RL-ing the model to make good completions while defering to the weak model as much as possible. Then, we could highlight on which tokens the stronger model didn’t defer to the weaker model and apply additional supervision to these tokens.
Some random similar methods I’ve thought about:
coup probes (how good is the baseline of just training a probe on off-policy data?)
Maybe for assisting with human supervision (or even making a better product) it would be useful to train a bunch of probes for different properties and then highlight text at a token-by-token level for the property. E.g. train a probe for “lying” or “hallucination” and then show this probe on a token by token level for the usage of a human overseer or monitor.
Maybe it would be useful to have make our powerful untrusted model mostly imitate a weak trusted model and only rarely deviate from it’s probablity distribution in cases where it thinks it is particularly important (perhaps the weak trusted model is done with an early exit head on the LLM or something similar to speculative decoding). This is ~equivalent to RL-ing the model to make good completions while defering to the weak model as much as possible. Then, we could highlight on which tokens the stronger model didn’t defer to the weaker model and apply additional supervision to these tokens.