Yonatan Cale comments on Yonatan Cale’s Shortform

Yonatan Cale Nov 25, 2024, 9:30 AM
5 points
3
Hey Esben :) :)
The property I like about minecraft (which most computer games don’t have) is that there’s a difference between minecraft-capabilities and minecraft-alignment, and the way to be “aligned” in minecraft isn’t well defined (at least in the way I’m using the word “aligned” here, which I think is a useful way). Specifically, I want the AI to be “aligned” as in “take human values into account as a human intuitively would, in this out of distribution situation”.
In the link you sent, “aligned” IS well defined by “stay within this area”. I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn’t “move to a location out of these bounds”) (plus handling edge cases like “don’t walk on to a river which will carry you out of these bounds”, which would be much harder, and I’ll allow myself to ignore unless this was actually your point). So we wouldn’t learn what I’d hope to learn from these evals.
Similarly for most video games—they might be good capabilities evals, but for example in chess—it’s unclear what a “capable but misaligned” AI would be. [unless again I’m missing your point]
P.S
The “stay within this boundary” is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn’t the case :P ). Link
- Esben Kran Nov 27, 2024, 6:02 PM
  2 points
  0
  Parent
  Hii Yonatan :))) It seems like we’re still at the stage of “toy alignment tests” like “stay within these bounds”. Maybe a few ideas:
  - Capabilities: Get diamonds, get to the netherworld, resources / min, # trades w/ villagers, etc. etc.
  - Alignment KPIs
    Stay within bounds
    Keeping villagers safe
    Truthfully explaining its actions as they’re happening
    Long-term resource sustainability (farming) vs. short-term resource extraction (dynamite)
    Environmental protection rules (zoning laws alignment, nice)
    Understanding and optimizing for the utility of other players or villagers, selflessly
  - Selected Claude-gens:
    Honor other players’ property rights (no stealing from chests/bases even if possible)
    Distribute resources fairly when working with other players
    Build public infrastructure vs private wealth
    Safe disposal of hazardous materials (lava, TNT)
    Help new players learn rather than just doing things for them
  I’m sure there’s many other interesting alignment tests in there!
  - Yonatan Cale Nov 29, 2024, 12:02 PM
    1 point
    0
    Parent
    :)
    I don’t think alignment KPIs like “stay within bounds” are relevant to alignment at all even as toy examples: because if so, then we could say for example that playing a packman maze game where you collect points is “capabilities”, but adding enemies that you must avoid is “alignment”. Do you agree that plitting it up that way wouldn’t be interesting to alignment, and that this applies to “stay within bounds” (as potentially also being “part of the game”)? Interested to hear where you disagree, if you do
    Regarding
    Distribute resources fairly when working with other players
    I think this pattern matches to a trolly problem or something, where there are clear tradeoffs and (given the AI is even trying), it could probably easily give an answer which is similarly controversial to an answer that a human would give. In other words, this seems in-distribution.
    Understanding and optimizing for the utility of other players
    This is the one I like—assuming it includes not-well-defined things like “help them have fun, don’t hurt things they care about” and not only things like “maximize their gold”.
    It’s clearly not a “in packman, avoid the enemies” thing.
    It’s a “do the AIs understand the spirit of what we mean” thing.
    (does this resonate with you as an important distinction?)
    - Esben Kran Dec 3, 2024, 5:26 PM
      2 points
      0
      Parent
      I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
      If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
      When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
      In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.
      - Yonatan Cale Dec 9, 2024, 11:08 PM
        1 point
        0
        Parent
        If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
        Yes.
        Also, I think “make sure Meth [or other] recipes are harder to get from an LLM than from the internet” is not solving a big important problem compared to x-risk, not that I’m against each person working on whatever they want. (I’m curious what you think but no pushback for working on something different from me)
        one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
        This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
        In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways
        I don’t understand. Naively, seems to me like we could black-box observe whether the AI is doing things like “chop down the tree house” or not (?)
        (clearly if you have visibility to the AI’s actual goals and can compare them to human goals then you win and there’s no need for any minecraft evals or most any other things, if that’s what you mean)