The main concern I have with this is whether its robust to different prompts probing for the same value. I can see a scenario where the model is reacting to how convincing the prompt sounds rather than high level features of it.
Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?
Sounds right. It would be interesting to see how extremely unconvincing you can get the prompts and still see the same behavior.
Also, ideally you would have a procedure for which its impossible for you to have gamed. Like, a problem right now is your could have tried a bunch of different prompts for each value, and then chosen prompts which cause the results you want, and never reported the prompts which don’t cause the results you want.
The main concern I have with this is whether its robust to different prompts probing for the same value. I can see a scenario where the model is reacting to how convincing the prompt sounds rather than high level features of it.
Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?
Sounds right. It would be interesting to see how extremely unconvincing you can get the prompts and still see the same behavior.
Also, ideally you would have a procedure for which its impossible for you to have gamed. Like, a problem right now is your could have tried a bunch of different prompts for each value, and then chosen prompts which cause the results you want, and never reported the prompts which don’t cause the results you want.