For each paper, count what percentage of citing papers are also in ANDB vs not in ANDB (or use some other way of classifying safety vs not safety papers)
Analyze differences by subject area / author affiliation
My hypothesis: RLHF, and OpenAI work in general, has high capabilities impact. For other domains e.g. interpretability, preventing bad behavior, agent foundations, I have high uncertainty over percentages.
Estimating how much safety research contributes to capabilities via citation counts
An analysis I’d like to see is:
Aggregate all papers linked on Alignment Newsletter Database (public) - Google Sheets
For each paper, count what percentage of citing papers are also in ANDB vs not in ANDB (or use some other way of classifying safety vs not safety papers)
Analyze differences by subject area / author affiliation
My hypothesis: RLHF, and OpenAI work in general, has high capabilities impact. For other domains e.g. interpretability, preventing bad behavior, agent foundations, I have high uncertainty over percentages.