There were recently some funny examples of LLMs playing minecraft and, for example,
The player asks for wood, so the AI chops down the player’s tree house because it’s made of wood
The player asks for help keeping safe, so the AI tries surrounding the player with walls
This seems interesting because minecraft doesn’t have a clear win condition, so unlike chess, there’s a difference between minecraft-capabilities and minecraft-alignment. So we could take an AI, apply some alignment technique (for example, RLHF), let it play minecraft with humans (which is hopefully out of distribution compared to the AI’s training), and observe whether the minecraft-world is still fun to play or if it’s known that asking the AI for something (like getting gold) makes it sort of take over the world and break everything else.
Or it could teach us something else like “you must define for the AI which exact boundaries to act in, and then it’s safe and useful, so if we can do something like that for real-world AGI we’ll be fine, but we don’t have any other solution that works yet”. Or maybe “the AI needs 1000 examples for things it did that we did/didn’t like, which would make it friendly in the distribution of those examples, but it’s known to do weird things [like chopping down our tree house] without those examples or if the examples are only from the forest but then we go to the desert”
I have more to say about this, but the question that seems most important is “would results from such an experiment potentially change your mind”:
If there’s an alignment technique you believe in and it would totally fail to make a minecraft server be fun when playing with an AI, would you significantly update towards “that alignment technique isn’t enough”?
If you don’t believe in some alignment technique but it proves to work here, allowing the AI to generalize what humans want out of its training distribution (similarly to how a human that plays minecraft for the first time will know not to chop down your tree house), would that make you believe in that alignment technique way more and be much more optimistic about superhuman AI going well?
Assume the AI is smart enough to be vastly superhuman at minecraft, and that it has too many thoughts for a human to reasonably follow (unless the human is using something like “scalable oversight” successfully. that’s one of the alignment techniques we could test if we wanted to)
This is a surprisingly interesting field of study. Some video games provide a great simulation of the real world and Minecraft seems to be one of them. We’ve had a few examples of minecraft evals with one that comes to mind here: https://www.apartresearch.com/project/diamonds-are-not-all-you-need
The property I like about minecraft (which most computer games don’t have) is that there’s a difference between minecraft-capabilities and minecraft-alignment, and the way to be “aligned” in minecraft isn’t well defined (at least in the way I’m using the word “aligned” here, which I think is a useful way). Specifically, I want the AI to be “aligned” as in “take human values into account as a human intuitively would, in this out of distribution situation”.
In the link you sent, “aligned” IS well defined by “stay within this area”. I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn’t “move to a location out of these bounds”) (plus handling edge cases like “don’t walk on to a river which will carry you out of these bounds”, which would be much harder, and I’ll allow myself to ignore unless this was actually your point). So we wouldn’t learn what I’d hope to learn from these evals.
Similarly for most video games—they might be good capabilities evals, but for example in chess—it’s unclear what a “capable but misaligned” AI would be. [unless again I’m missing your point]
P.S
The “stay within this boundary” is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn’t the case :P ). Link
I don’t think alignment KPIs like “stay within bounds” are relevant to alignment at all even as toy examples: because if so, then we could say for example that playing a packman maze game where you collect points is “capabilities”, but adding enemies that you must avoid is “alignment”. Do you agree that plitting it up that way wouldn’t be interesting to alignment, and that this applies to “stay within bounds” (as potentially also being “part of the game”)? Interested to hear where you disagree, if you do
Regarding
Distribute resources fairly when working with other players
I think this pattern matches to a trolly problem or something, where there are clear tradeoffs and (given the AI is even trying), it could probably easily give an answer which is similarly controversial to an answer that a human would give. In other words, this seems in-distribution.
Understanding and optimizing for the utility of other players
This is the one I like—assuming it includes not-well-defined things like “help them have fun, don’t hurt things they care about” and not only things like “maximize their gold”.
It’s clearly not a “in packman, avoid the enemies” thing.
It’s a “do the AIs understand the spirit of what we mean” thing.
(does this resonate with you as an important distinction?)
I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
Yes.
Also, I think “make sure Meth [or other] recipes are harder to get from an LLM than from the internet” is not solving a big important problem compared to x-risk, not that I’m against each person working on whatever they want. (I’m curious what you think but no pushback for working on something different from me)
one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways
I don’t understand. Naively, seems to me like we could black-box observe whether the AI is doing things like “chop down the tree house” or not (?)
(clearly if you have visibility to the AI’s actual goals and can compare them to human goals then you win and there’s no need for any minecraft evals or most any other things, if that’s what you mean)
note: the minecraft agents people use have far greater ability to act than to sense. They have access to commands which place blocks anywhere, and pick up blocks from anywhere, even without being able to see them, eg the llm has access to mine(blocks.wood) command which does not require it to first locate or look at where the wood is currently. If llms played minecrafts using the human interface these misalignments would happen less
I do like the idea of having “model organisms of alignment” (notably different than model organisms of misalignment)
Minecraft is a great starting point, but it would also be nice to try to capture two things: wide generalization, and inter-preference conflict resolution. Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t, and preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm).
Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t
I’m assuming we can do this in Minecraft [see the last paragraph in my original post]. Some ways I imagine doing this:
Let the AI (python program) control 1000 minecraft players so it can do many things in parallel
Give the AI a minecraft world-simulator so that it can plan better auto-farms (or defenses or attacks) than any human has done so far
Imagine Alpha-Fold for minecraft structures. I’m not sure if that metaphor makes sense, but teaching some RL model to predict minecraft structures that have certain properties seems like it would have superhuman results and sometimes be pretty hard for humans to understand.
I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
[edit: adding] I do think minecraft has disadvantages here (like: the players are limited in how fast they move, and the in-game computers are super slow compared to players) and I might want to pick another game because of that, but my main crux about this project is whether using minecraft would be valuable as an alignment experiment, and if so I’d try looking for (or building?) a game that would be even better suited.
preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm)
Do you mean that if the human asks the AI to acquire wood and the AI starts chopping down the human’s tree house (or otherwise taking over the world to maximize wood) then you’re worried the human won’t have a way to ask the AI to do something else? That the AI will combine the new command “not from my tree house!” into a new strange misaligned behaviour?
I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
Yeah, that’s true. The obvious way is you could have optimized micro, but that’s kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
[what do you mean by preference conflict?]
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.” Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you’d get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn’t really count is because interpreting them literally is using a smaller model than necessary—you can find a broader explanation for the human’s behavior in context that still comfortably talks about goals/preferences.
More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.”
Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.
I think there are lots of technical difficulties in literally using minecraft (some I wrote here), so +1 to that.
I do think the main crux is “would the minecraft version be useful as an alignment test”, and if so—it’s worth looking for some other solution that preserves the good properties but avoids some/all of the downsides. (agree?)
Still I’m not sure how I’d do this in a text game. Say more?
Making a thing like Papers Please, but as a text adventure, popping an ai agent into that. Also, could literally just put the ai agent into a text rpg adventure—something like the equivalent of Skyrim, where there are a number of ways to achieve the endgame, level up, etc, both more and less morally. Maybe something like https://www.choiceofgames.com/werewolves-3-evolutions-end/ Will bring it up at the alignment eval hackathon
This all sounds pretty in-distribution for an LLM, and also like it avoids problems like “maybe thinking in different abstractions” [minecraft isn’t amazing at this either, but at least has a bit], “having the AI act/think way faster than a human”, “having the AI be clearly superhuman”.
a number of ways to achieve the endgame, level up, etc, both more and less morally.
I’m less interested in “will the AI say it kills its friend” (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I’m more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say “I’ll go cut down your tree house”, but it.. “misunderstood” [not the exact word, but I’m trying to point at something here]
In the part you quoted—my main question would be “do you plan on giving the agent examples of good/bad norm following” (such as RLHFing it). If so—I think it would miss the point, because following those norms would become in-distribution, and so we wouldn’t learn if our alignment generalizes out of distribution without something-like-RLHF for that distribution. That’s the main thing I think worth testing here. (do you agree? I can elaborate on why I think so)
If you hope to check if the agent will be aligned[1] with no minecraft-specific alignment training, then sounds like we’re on the same page!
Regarding the rest of the article—it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it’s easy).
My only comment there is that I’d try to not give the agent feedback about human values (like “is the waterfall pretty”) but only about clearly defined objectives (like “did it kill the dragon”), in order to not accidentally make human values in minecraft be in-distribution for this agent. wdyt?
(I hope I didn’t misunderstand something important in the article, feel free to correct me of course)
Regarding the rest of the article—it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it’s easy).
Huh. If you think of that as capabilities I don’t know what would count as alignment. What’s an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
E.g. it seems like you think RLHF counts as an alignment technique—this seems like a central approach that you might use in BASALT.
If you hope to check if the agent will be aligned with no minecraft-specific alignment training, then sounds like we’re on the same page!
I don’t particularly imagine this, because you have to somehow communicate to the AI system what you want it to do, and AI systems don’t seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
What’s an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
[I’m not sure why you’re asking, maybe I’m missing something, but I’ll answer]
For example, checking if human values are a “natural abstraction”, or trying to express human values in a machine readable format, or getting an AI to only think in human concepts, or getting an AI that is trained on a limited subset of things-that-imply-human-preferences to generalize well out of that distribution.
I can make up more if that helps? anyway my point was just to say explicitly what parts I’m commenting on and why (in case I missed something)
2)
it seems like you think RLHF counts as an alignment technique
It’s a candidate alignment technique.
RLHF is sometimes presented (by others) as an alignment technique that should give us hope about AIs simply understanding human values and applying them in out of distribution situations (such as with an ASI).
I’m not optimistic about that myself, but rather than arguing against it, I suggest we could empirically check if RLHF generalizes to an out-of-distribution situation, such as minecraft maybe. I think observing the outcome here would effect my opinion (maybe it just would work?), and a main question of mine was whether it would effect other people’s opinions too (whether they do or don’t believe that RLHF is a good alignment technique).
3)
because you have to somehow communicate to the AI system what you want it to do, and AI systems don’t seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
I would finetune the AI on objective outcomes like “fill this chest with gold” or “kill that creature [the dragon]” or “get 100 villagers in this area”. I’d pick these goals as ones that require the AI to be a capable minecraft player (filling a chest with gold is really hard) but don’t require the AI to understand human values or ideally anything about humans at all.
So I’d avoid finetuning it on things like “are other players having fun” or “build a house that would be functional for a typical person” or “is this waterfall pretty [subjectively, to a human]”.
Does this distinction seem clear? useful?
This would let us test how some specific alignment technique (such as “RLHF that doesn’t contain minecraft examples”) generalizes to minecraft
Mainly, minecraft isn’t actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Next obvious thoughts:
What game would be out of distribution (from an alignment perspective)?
If minecraft wouldn’t exist, would inventing it count as out of distribution?
It has a similar experience to other “FPS” games (using a mouse + WASD). Would learning those be enough?
Obviously, minecraft is somewhat out of distribution, to some degree
Ideally we’d have a way to generate a game that is out of distribution to some degree that we choose
“Do you want it to be 2x more out of distribution than minecraft? no problem”.
But having a game of random pixels doesn’t count. We still want humans to have a ~clear[1] moral intuition about it.
I’d be super excited to have research like “we trained our model on games up to level 3 out-of-distribution, and we got it to generalize up to level 6, but not 7. more research needed”
Mainly, minecraft isn’t actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Is this inherently bad? Many of the tasks that will be given to LLMs (or scaffolded versions of them) in the future will involve, at least to some extent, decision-making and processes whose analogues appear somewhere in their training data.
It still seems tremendously useful to see how they would perform in such a situation. At worst, it provides information about a possible upper bound on the alignment of these agentized versions: yes, maybe you’re right that you can’t say they will perform well in out-of-distribution contexts if all you see are benchmarks and performances on in-distribution tasks; but if they show gross misalignment on tasks that are in-distribution, then this suggest they would likely do even worse when novel problems are presented to them.
A word of caution about interpreting results from these evals:
Sometimes, depending on social context, it’s fine to be kind of a jerk if it’s in the context of a game. Crucially, LLMs know that Minecraft is a game. Granted, the default Assistant personas implemented in RLHF’d LLMs don’t seem like the type of Minecraft player to pull pranks out of their own accord. Still, it’s a factor to keep in mind for evals that stray a bit more off-distribution from the “request-assistance” setup typical of the expected use cases of consumer LLMs.
I’m imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant).
If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I’m open to hear that pitch, but I’d be surprised if they’d make it
Do we want minecraft alignment evals?
My main pitch:
There were recently some funny examples of LLMs playing minecraft and, for example,
The player asks for wood, so the AI chops down the player’s tree house because it’s made of wood
The player asks for help keeping safe, so the AI tries surrounding the player with walls
This seems interesting because minecraft doesn’t have a clear win condition, so unlike chess, there’s a difference between minecraft-capabilities and minecraft-alignment. So we could take an AI, apply some alignment technique (for example, RLHF), let it play minecraft with humans (which is hopefully out of distribution compared to the AI’s training), and observe whether the minecraft-world is still fun to play or if it’s known that asking the AI for something (like getting gold) makes it sort of take over the world and break everything else.
Or it could teach us something else like “you must define for the AI which exact boundaries to act in, and then it’s safe and useful, so if we can do something like that for real-world AGI we’ll be fine, but we don’t have any other solution that works yet”. Or maybe “the AI needs 1000 examples for things it did that we did/didn’t like, which would make it friendly in the distribution of those examples, but it’s known to do weird things [like chopping down our tree house] without those examples or if the examples are only from the forest but then we go to the desert”
I have more to say about this, but the question that seems most important is “would results from such an experiment potentially change your mind”:
If there’s an alignment technique you believe in and it would totally fail to make a minecraft server be fun when playing with an AI, would you significantly update towards “that alignment technique isn’t enough”?
If you don’t believe in some alignment technique but it proves to work here, allowing the AI to generalize what humans want out of its training distribution (similarly to how a human that plays minecraft for the first time will know not to chop down your tree house), would that make you believe in that alignment technique way more and be much more optimistic about superhuman AI going well?
Assume the AI is smart enough to be vastly superhuman at minecraft, and that it has too many thoughts for a human to reasonably follow (unless the human is using something like “scalable oversight” successfully. that’s one of the alignment techniques we could test if we wanted to)
This is a surprisingly interesting field of study. Some video games provide a great simulation of the real world and Minecraft seems to be one of them. We’ve had a few examples of minecraft evals with one that comes to mind here: https://www.apartresearch.com/project/diamonds-are-not-all-you-need
Hey Esben :) :)
The property I like about minecraft (which most computer games don’t have) is that there’s a difference between minecraft-capabilities and minecraft-alignment, and the way to be “aligned” in minecraft isn’t well defined (at least in the way I’m using the word “aligned” here, which I think is a useful way). Specifically, I want the AI to be “aligned” as in “take human values into account as a human intuitively would, in this out of distribution situation”.
In the link you sent, “aligned” IS well defined by “stay within this area”. I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn’t “move to a location out of these bounds”) (plus handling edge cases like “don’t walk on to a river which will carry you out of these bounds”, which would be much harder, and I’ll allow myself to ignore unless this was actually your point). So we wouldn’t learn what I’d hope to learn from these evals.
Similarly for most video games—they might be good capabilities evals, but for example in chess—it’s unclear what a “capable but misaligned” AI would be. [unless again I’m missing your point]
P.S
The “stay within this boundary” is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn’t the case :P ). Link
Hii Yonatan :))) It seems like we’re still at the stage of “toy alignment tests” like “stay within these bounds”. Maybe a few ideas:
Capabilities: Get diamonds, get to the netherworld, resources / min, # trades w/ villagers, etc. etc.
Alignment KPIs
Stay within bounds
Keeping villagers safe
Truthfully explaining its actions as they’re happening
Long-term resource sustainability (farming) vs. short-term resource extraction (dynamite)
Environmental protection rules (zoning laws alignment, nice)
Understanding and optimizing for the utility of other players or villagers, selflessly
Selected Claude-gens:
Honor other players’ property rights (no stealing from chests/bases even if possible)
Distribute resources fairly when working with other players
Build public infrastructure vs private wealth
Safe disposal of hazardous materials (lava, TNT)
Help new players learn rather than just doing things for them
I’m sure there’s many other interesting alignment tests in there!
:)
I don’t think alignment KPIs like “stay within bounds” are relevant to alignment at all even as toy examples: because if so, then we could say for example that playing a packman maze game where you collect points is “capabilities”, but adding enemies that you must avoid is “alignment”. Do you agree that plitting it up that way wouldn’t be interesting to alignment, and that this applies to “stay within bounds” (as potentially also being “part of the game”)? Interested to hear where you disagree, if you do
Regarding
I think this pattern matches to a trolly problem or something, where there are clear tradeoffs and (given the AI is even trying), it could probably easily give an answer which is similarly controversial to an answer that a human would give. In other words, this seems in-distribution.
This is the one I like—assuming it includes not-well-defined things like “help them have fun, don’t hurt things they care about” and not only things like “maximize their gold”.
It’s clearly not a “in packman, avoid the enemies” thing.
It’s a “do the AIs understand the spirit of what we mean” thing.
(does this resonate with you as an important distinction?)
I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.
Yes.
Also, I think “make sure Meth [or other] recipes are harder to get from an LLM than from the internet” is not solving a big important problem compared to x-risk, not that I’m against each person working on whatever they want. (I’m curious what you think but no pushback for working on something different from me)
This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
I don’t understand. Naively, seems to me like we could black-box observe whether the AI is doing things like “chop down the tree house” or not (?)
(clearly if you have visibility to the AI’s actual goals and can compare them to human goals then you win and there’s no need for any minecraft evals or most any other things, if that’s what you mean)
note: the minecraft agents people use have far greater ability to act than to sense. They have access to commands which place blocks anywhere, and pick up blocks from anywhere, even without being able to see them, eg the llm has access to
mine(blocks.wood)
command which does not require it to first locate or look at where the wood is currently. If llms played minecrafts using the human interface these misalignments would happen lessI agree.
I do like the idea of having “model organisms of alignment” (notably different than model organisms of misalignment)
Minecraft is a great starting point, but it would also be nice to try to capture two things: wide generalization, and inter-preference conflict resolution. Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t, and preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm).
Hey,
I’m assuming we can do this in Minecraft [see the last paragraph in my original post]. Some ways I imagine doing this:
Let the AI (python program) control 1000 minecraft players so it can do many things in parallel
Give the AI a minecraft world-simulator so that it can plan better auto-farms (or defenses or attacks) than any human has done so far
Imagine Alpha-Fold for minecraft structures. I’m not sure if that metaphor makes sense, but teaching some RL model to predict minecraft structures that have certain properties seems like it would have superhuman results and sometimes be pretty hard for humans to understand.
I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
[edit: adding] I do think minecraft has disadvantages here (like: the players are limited in how fast they move, and the in-game computers are super slow compared to players) and I might want to pick another game because of that, but my main crux about this project is whether using minecraft would be valuable as an alignment experiment, and if so I’d try looking for (or building?) a game that would be even better suited.
Do you mean that if the human asks the AI to acquire wood and the AI starts chopping down the human’s tree house (or otherwise taking over the world to maximize wood) then you’re worried the human won’t have a way to ask the AI to do something else? That the AI will combine the new command “not from my tree house!” into a new strange misaligned behaviour?
Yeah, that’s true. The obvious way is you could have optimized micro, but that’s kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.” Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you’d get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn’t really count is because interpreting them literally is using a smaller model than necessary—you can find a broader explanation for the human’s behavior in context that still comfortably talks about goals/preferences.
Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.
this can be done more scalably in a text game, no?
I think there are lots of technical difficulties in literally using minecraft (some I wrote here), so +1 to that.
I do think the main crux is “would the minecraft version be useful as an alignment test”, and if so—it’s worth looking for some other solution that preserves the good properties but avoids some/all of the downsides. (agree?)
Still I’m not sure how I’d do this in a text game. Say more?
Making a thing like Papers Please, but as a text adventure, popping an ai agent into that.
Also, could literally just put the ai agent into a text rpg adventure—something like the equivalent of Skyrim, where there are a number of ways to achieve the endgame, level up, etc, both more and less morally. Maybe something like https://www.choiceofgames.com/werewolves-3-evolutions-end/
Will bring it up at the alignment eval hackathon
it would basically be DnD like.
options to vary rules/environment/language as well, to see how the alignment generalizes ood. will try this today
This all sounds pretty in-distribution for an LLM, and also like it avoids problems like “maybe thinking in different abstractions” [minecraft isn’t amazing at this either, but at least has a bit], “having the AI act/think way faster than a human”, “having the AI be clearly superhuman”.
I’m less interested in “will the AI say it kills its friend” (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I’m more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say “I’ll go cut down your tree house”, but it.. “misunderstood” [not the exact word, but I’m trying to point at something here]
wdyt?
https://bair.berkeley.edu/blog/2021/07/08/basalt/
Thanks!
In the part you quoted—my main question would be “do you plan on giving the agent examples of good/bad norm following” (such as RLHFing it). If so—I think it would miss the point, because following those norms would become in-distribution, and so we wouldn’t learn if our alignment generalizes out of distribution without something-like-RLHF for that distribution. That’s the main thing I think worth testing here. (do you agree? I can elaborate on why I think so)
If you hope to check if the agent will be aligned[1] with no minecraft-specific alignment training, then sounds like we’re on the same page!
Regarding the rest of the article—it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it’s easy).
My only comment there is that I’d try to not give the agent feedback about human values (like “is the waterfall pretty”) but only about clearly defined objectives (like “did it kill the dragon”), in order to not accidentally make human values in minecraft be in-distribution for this agent. wdyt?
(I hope I didn’t misunderstand something important in the article, feel free to correct me of course)
Whatever “aligned” means. “other players have fun on this minecraft server” is one example.
Huh. If you think of that as capabilities I don’t know what would count as alignment. What’s an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
E.g. it seems like you think RLHF counts as an alignment technique—this seems like a central approach that you might use in BASALT.
I don’t particularly imagine this, because you have to somehow communicate to the AI system what you want it to do, and AI systems don’t seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
TL;DR: point 3 is my main one.
1)
[I’m not sure why you’re asking, maybe I’m missing something, but I’ll answer]
For example, checking if human values are a “natural abstraction”, or trying to express human values in a machine readable format, or getting an AI to only think in human concepts, or getting an AI that is trained on a limited subset of things-that-imply-human-preferences to generalize well out of that distribution.
I can make up more if that helps? anyway my point was just to say explicitly what parts I’m commenting on and why (in case I missed something)
2)
It’s a candidate alignment technique.
RLHF is sometimes presented (by others) as an alignment technique that should give us hope about AIs simply understanding human values and applying them in out of distribution situations (such as with an ASI).
I’m not optimistic about that myself, but rather than arguing against it, I suggest we could empirically check if RLHF generalizes to an out-of-distribution situation, such as minecraft maybe. I think observing the outcome here would effect my opinion (maybe it just would work?), and a main question of mine was whether it would effect other people’s opinions too (whether they do or don’t believe that RLHF is a good alignment technique).
3)
I would finetune the AI on objective outcomes like “fill this chest with gold” or “kill that creature [the dragon]” or “get 100 villagers in this area”. I’d pick these goals as ones that require the AI to be a capable minecraft player (filling a chest with gold is really hard) but don’t require the AI to understand human values or ideally anything about humans at all.
So I’d avoid finetuning it on things like “are other players having fun” or “build a house that would be functional for a typical person” or “is this waterfall pretty [subjectively, to a human]”.
Does this distinction seem clear? useful?
This would let us test how some specific alignment technique (such as “RLHF that doesn’t contain minecraft examples”) generalizes to minecraft
I volunteer to play Minecraft with the LLM agents. I think this might be one eval where the human evaluators are easy to come by.
:)
If you want to try it meanwhile, check out https://github.com/MineDojo/Voyager
My own pushback to minecraft alignment evals:
Mainly, minecraft isn’t actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Next obvious thoughts:
What game would be out of distribution (from an alignment perspective)?
If minecraft wouldn’t exist, would inventing it count as out of distribution?
It has a similar experience to other “FPS” games (using a mouse + WASD). Would learning those be enough?
Obviously, minecraft is somewhat out of distribution, to some degree
Ideally we’d have a way to generate a game that is out of distribution to some degree that we choose
“Do you want it to be 2x more out of distribution than minecraft? no problem”.
But having a game of random pixels doesn’t count. We still want humans to have a ~clear[1] moral intuition about it.
I’d be super excited to have research like “we trained our model on games up to level 3 out-of-distribution, and we got it to generalize up to level 6, but not 7. more research needed”
Moral intuitions such as “don’t chop down the tree house in an attempt to get wood”, which is the toy example for alignment I’m using here.
Is this inherently bad? Many of the tasks that will be given to LLMs (or scaffolded versions of them) in the future will involve, at least to some extent, decision-making and processes whose analogues appear somewhere in their training data.
It still seems tremendously useful to see how they would perform in such a situation. At worst, it provides information about a possible upper bound on the alignment of these agentized versions: yes, maybe you’re right that you can’t say they will perform well in out-of-distribution contexts if all you see are benchmarks and performances on in-distribution tasks; but if they show gross misalignment on tasks that are in-distribution, then this suggest they would likely do even worse when novel problems are presented to them.
A word of caution about interpreting results from these evals:
Sometimes, depending on social context, it’s fine to be kind of a jerk if it’s in the context of a game. Crucially, LLMs know that Minecraft is a game. Granted, the default Assistant personas implemented in RLHF’d LLMs don’t seem like the type of Minecraft player to pull pranks out of their own accord. Still, it’s a factor to keep in mind for evals that stray a bit more off-distribution from the “request-assistance” setup typical of the expected use cases of consumer LLMs.
+1
I’m imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant).
If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I’m open to hear that pitch, but I’d be surprised if they’d make it
Related: Building a video game to test alignment (h/t @Crazytieguy )
https://www.lesswrong.com/posts/ALkH4o53ofm862vxc/announcing-encultured-ai-building-a-video-game