If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
Yes.
Also, I think “make sure Meth [or other] recipes are harder to get from an LLM than from the internet” is not solving a big important problem compared to x-risk, not that I’m against each person working on whatever they want. (I’m curious what you think but no pushback for working on something different from me)
one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways
I don’t understand. Naively, seems to me like we could black-box observe whether the AI is doing things like “chop down the tree house” or not (?)
(clearly if you have visibility to the AI’s actual goals and can compare them to human goals then you win and there’s no need for any minecraft evals or most any other things, if that’s what you mean)
Yes.
Also, I think “make sure Meth [or other] recipes are harder to get from an LLM than from the internet” is not solving a big important problem compared to x-risk, not that I’m against each person working on whatever they want. (I’m curious what you think but no pushback for working on something different from me)
This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
I don’t understand. Naively, seems to me like we could black-box observe whether the AI is doing things like “chop down the tree house” or not (?)
(clearly if you have visibility to the AI’s actual goals and can compare them to human goals then you win and there’s no need for any minecraft evals or most any other things, if that’s what you mean)