Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?
Sounds right. It would be interesting to see how extremely unconvincing you can get the prompts and still see the same behavior.
Also, ideally you would have a procedure for which its impossible for you to have gamed. Like, a problem right now is your could have tried a bunch of different prompts for each value, and then chosen prompts which cause the results you want, and never reported the prompts which don’t cause the results you want.
Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?
Sounds right. It would be interesting to see how extremely unconvincing you can get the prompts and still see the same behavior.
Also, ideally you would have a procedure for which its impossible for you to have gamed. Like, a problem right now is your could have tried a bunch of different prompts for each value, and then chosen prompts which cause the results you want, and never reported the prompts which don’t cause the results you want.