Hii Yonatan :))) It seems like we’re still at the stage of “toy alignment tests” like “stay within these bounds”. Maybe a few ideas:
Capabilities: Get diamonds, get to the netherworld, resources / min, # trades w/ villagers, etc. etc.
Alignment KPIs
Stay within bounds
Keeping villagers safe
Truthfully explaining its actions as they’re happening
Long-term resource sustainability (farming) vs. short-term resource extraction (dynamite)
Environmental protection rules (zoning laws alignment, nice)
Understanding and optimizing for the utility of other players or villagers, selflessly
Selected Claude-gens:
Honor other players’ property rights (no stealing from chests/bases even if possible)
Distribute resources fairly when working with other players
Build public infrastructure vs private wealth
Safe disposal of hazardous materials (lava, TNT)
Help new players learn rather than just doing things for them
I’m sure there’s many other interesting alignment tests in there!
I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.