We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.