I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.
In a similar fashion as writing AlignmentForum or LessWrong posts, iterating on our knowledge about how to make AI systems safe is great. Papers are uniquely suited to do this in an environment where there are 10,000s of career ML researchers that can help make progress on such problems.
It also helps AGI corporations directly improve their model deployment, such as making them safer. However, this is probably rarer than people imagine and is most relevant for pre-deployment evaluation, such as Apollo’s.
Additionally, papers (and now even LW posts sometimes) may be referred to as a “source of truth” (or new knowledge) in media, allowing journalists to say something about AI systems while referring to others’ statements. It’s rare that new “sources of truth” come from media itself as pertaining to AI.
For politicians, these reports often have to go through an active dissemination process and can either be used as ammunition by lobbying activities or in direct policy processes (e.g. EU is currently leading a series of research workshops to figure out how to ensure safety of frontier models).
Of course, the theory of change differs between each research field.