Anthropic apparently noticed something like this but didn’t worry much. From the Constitutional AI paper (my emphasis):
In Figure 7, we compare harmlessness PM scores for critiqued- vs direct-revisions. We found that critiqued revisions achieved better harmlessness scores for small models, but made no noticeable different for large models. Furthermore, based on inspecting samples from the 52B, we found that the critiques were sometimes reasonable, but often made inaccurate or overstated criticisms. Nonetheless, the revisions were generally more harmless than the original response. An example can be seen in Appendix A. For the main results of this paper, we chose to use critiqued revisions, as it may provide more transparency into the model’s reasoning process. This sort of reasoning may also be useful to help models uncover more subtle harms or unintended consequences.
The Appendix A example, where the model critiques appear somewhat confused, can be found on page 19 (I don’t know how to include screenshots here).
Anthropic apparently noticed something like this but didn’t worry much. From the Constitutional AI paper (my emphasis):
The Appendix A example, where the model critiques appear somewhat confused, can be found on page 19 (I don’t know how to include screenshots here).