Yonatan Cale comments on Yonatan Cale’s Shortform

Yonatan Cale Nov 25, 2024, 9:51 AM
1 point
0
Hey,
Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t
I’m assuming we can do this in Minecraft [see the last paragraph in my original post]. Some ways I imagine doing this:
1. Let the AI (python program) control 1000 minecraft players so it can do many things in parallel
2. Give the AI a minecraft world-simulator so that it can plan better auto-farms (or defenses or attacks) than any human has done so far
  1. Imagine Alpha-Fold for minecraft structures. I’m not sure if that metaphor makes sense, but teaching some RL model to predict minecraft structures that have certain properties seems like it would have superhuman results and sometimes be pretty hard for humans to understand.
  2. I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
  3. [edit: adding] I do think minecraft has disadvantages here (like: the players are limited in how fast they move, and the in-game computers are super slow compared to players) and I might want to pick another game because of that, but my main crux about this project is whether using minecraft would be valuable as an alignment experiment, and if so I’d try looking for (or building?) a game that would be even better suited.
preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm)
Do you mean that if the human asks the AI to acquire wood and the AI starts chopping down the human’s tree house (or otherwise taking over the world to maximize wood) then you’re worried the human won’t have a way to ask the AI to do something else? That the AI will combine the new command “not from my tree house!” into a new strange misaligned behaviour?
What links here?
- Yonatan Cale's comment on Yonatan Cale’s Shortform by Yonatan Cale (Nov 26, 2024, 1:59 PM; 1 point)
- Charlie Steiner Nov 25, 2024, 7:09 PM
  2 points
  0
  Parent
  
  I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
  
  Yeah, that’s true. The obvious way is you could have optimized micro, but that’s kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
  
  [what do you mean by preference conflict?]
  
  I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.” Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you’d get a model that looked like physiology and lost the straightforward talk of goals/preferences
  
  Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn’t really count is because interpreting them literally is using a smaller model than necessary—you can find a broader explanation for the human’s behavior in context that still comfortably talks about goals/preferences.
  - Yonatan Cale Nov 26, 2024, 1:54 PM
    1 point
    0
    Parent
    More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
    Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
    I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.”
    Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.
- Kabir Kumar Nov 25, 2024, 6:31 PM
  2 points
  0
  Parent
  this can be done more scalably in a text game, no?
  - Yonatan Cale Nov 26, 2024, 1:59 PM
    1 point
    0
    Parent
    I think there are lots of technical difficulties in literally using minecraft (some I wrote here), so +1 to that.
    I do think the main crux is “would the minecraft version be useful as an alignment test”, and if so—it’s worth looking for some other solution that preserves the good properties but avoids some/all of the downsides. (agree?)
    Still I’m not sure how I’d do this in a text game. Say more?
    - Kabir Kumar Nov 26, 2024, 2:10 PM
      1 point
      0
      Parent
      Making a thing like Papers Please, but as a text adventure, popping an ai agent into that.
      Also, could literally just put the ai agent into a text rpg adventure—something like the equivalent of Skyrim, where there are a number of ways to achieve the endgame, level up, etc, both more and less morally. Maybe something like https://www.choiceofgames.com/werewolves-3-evolutions-end/
      Will bring it up at the alignment eval hackathon
      - Kabir Kumar Nov 26, 2024, 2:12 PM
        1 point
        0
        Parent
        it would basically be DnD like.
        Kabir Kumar Nov 26, 2024, 2:13 PM
        1 point
        0
        Parent
        options to vary rules/environment/language as well, to see how the alignment generalizes ood. will try this today
        Yonatan Cale Nov 26, 2024, 2:21 PM
        1 point
        0
        Parent
        This all sounds pretty in-distribution for an LLM, and also like it avoids problems like “maybe thinking in different abstractions” [minecraft isn’t amazing at this either, but at least has a bit], “having the AI act/think way faster than a human”, “having the AI be clearly superhuman”.
        a number of ways to achieve the endgame, level up, etc, both more and less morally.
        I’m less interested in “will the AI say it kills its friend” (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I’m more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say “I’ll go cut down your tree house”, but it.. “misunderstood” [not the exact word, but I’m trying to point at something here]
        wdyt?