I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
Yes.
Also, I think “make sure Meth [or other] recipes are harder to get from an LLM than from the internet” is not solving a big important problem compared to x-risk, not that I’m against each person working on whatever they want. (I’m curious what you think but no pushback for working on something different from me)
one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways
I don’t understand. Naively, seems to me like we could black-box observe whether the AI is doing things like “chop down the tree house” or not (?)
(clearly if you have visibility to the AI’s actual goals and can compare them to human goals then you win and there’s no need for any minecraft evals or most any other things, if that’s what you mean)
I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.
Yes.
Also, I think “make sure Meth [or other] recipes are harder to get from an LLM than from the internet” is not solving a big important problem compared to x-risk, not that I’m against each person working on whatever they want. (I’m curious what you think but no pushback for working on something different from me)
This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
I don’t understand. Naively, seems to me like we could black-box observe whether the AI is doing things like “chop down the tree house” or not (?)
(clearly if you have visibility to the AI’s actual goals and can compare them to human goals then you win and there’s no need for any minecraft evals or most any other things, if that’s what you mean)