The problem with ‘show your work’ and grading on steps is that at best you can’t do anything your teacher doesn’t understand
Being told to ‘show your work’ and graded on the steps helps you learn the steps and by default murders your creativity, execution style
I can see how this could in some cases end up impacting creativity, but I think this concern is at best overstated. I think the analogy to school is subtly incorrect, the rating policy is not actually the same, even though both are named “show your working”.
In the paper OpenAI have a “neutral” rating as well as positive/negative. While it’s possible that overzealous raters could just mark anything they don’t understand as “negative”, I think it’s fairly clear that would be a bad policy, and a competent implementor would instruct raters against that. In this design you want negative to mean “actually incorrect” not “unexpected / nonstandard”. (To be clear though I wasn’t able to confirm this detail in the paper.)
Furthermore if you are, say, using WolframAlpha or some theorem prover to rate intermediate steps automatically, it’s easier to detect incorrect steps, and harder to detect neutral/unhelpful/tautological steps. So in some sense the “default” implementation is to have no opinion other than “I can/can’t prove this step false” and I think this doesn’t have the problem you are worried about.
As a follow-up you could easily imagine collecting correct outputs with no negative intermediates and then scoring neutral intermediates with other heuristics like brevity or even novelty, which would allow the AI the freedom it needs to discover new ideas.
So in short while I think it’s possible that unimaginative / intellectually-conservative model builders could use this approach to choke the creativity of models, it seems like an obvious error and anyone doing so will lose in the market. I suppose this might come up if we get regulation on safety mechanisms that require some specific broken form of “show your working” training for morality/law abiding behavior, but that seems an unlikely multi-step hypothetical.
I can see how this could in some cases end up impacting creativity, but I think this concern is at best overstated. I think the analogy to school is subtly incorrect, the rating policy is not actually the same, even though both are named “show your working”.
In the paper OpenAI have a “neutral” rating as well as positive/negative. While it’s possible that overzealous raters could just mark anything they don’t understand as “negative”, I think it’s fairly clear that would be a bad policy, and a competent implementor would instruct raters against that. In this design you want negative to mean “actually incorrect” not “unexpected / nonstandard”. (To be clear though I wasn’t able to confirm this detail in the paper.)
Furthermore if you are, say, using WolframAlpha or some theorem prover to rate intermediate steps automatically, it’s easier to detect incorrect steps, and harder to detect neutral/unhelpful/tautological steps. So in some sense the “default” implementation is to have no opinion other than “I can/can’t prove this step false” and I think this doesn’t have the problem you are worried about.
As a follow-up you could easily imagine collecting correct outputs with no negative intermediates and then scoring neutral intermediates with other heuristics like brevity or even novelty, which would allow the AI the freedom it needs to discover new ideas.
So in short while I think it’s possible that unimaginative / intellectually-conservative model builders could use this approach to choke the creativity of models, it seems like an obvious error and anyone doing so will lose in the market. I suppose this might come up if we get regulation on safety mechanisms that require some specific broken form of “show your working” training for morality/law abiding behavior, but that seems an unlikely multi-step hypothetical.