Yonatan Cale comments on Yonatan Cale’s Shortform

Yonatan Cale Nov 29, 2024, 12:02 PM
1 point
0
:)
I don’t think alignment KPIs like “stay within bounds” are relevant to alignment at all even as toy examples: because if so, then we could say for example that playing a packman maze game where you collect points is “capabilities”, but adding enemies that you must avoid is “alignment”. Do you agree that plitting it up that way wouldn’t be interesting to alignment, and that this applies to “stay within bounds” (as potentially also being “part of the game”)? Interested to hear where you disagree, if you do
Regarding
Distribute resources fairly when working with other players
I think this pattern matches to a trolly problem or something, where there are clear tradeoffs and (given the AI is even trying), it could probably easily give an answer which is similarly controversial to an answer that a human would give. In other words, this seems in-distribution.
Understanding and optimizing for the utility of other players
This is the one I like—assuming it includes not-well-defined things like “help them have fun, don’t hurt things they care about” and not only things like “maximize their gold”.
It’s clearly not a “in packman, avoid the enemies” thing.
It’s a “do the AIs understand the spirit of what we mean” thing.
(does this resonate with you as an important distinction?)
- Esben Kran Dec 3, 2024, 5:26 PM
  2 points
  0
  Parent
  I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
  If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
  When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
  In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.
  - Yonatan Cale Dec 9, 2024, 11:08 PM
    1 point
    0
    Parent
    If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
    Yes.
    Also, I think “make sure Meth [or other] recipes are harder to get from an LLM than from the internet” is not solving a big important problem compared to x-risk, not that I’m against each person working on whatever they want. (I’m curious what you think but no pushback for working on something different from me)
    one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
    This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
    In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways
    I don’t understand. Naively, seems to me like we could black-box observe whether the AI is doing things like “chop down the tree house” or not (?)
    (clearly if you have visibility to the AI’s actual goals and can compare them to human goals then you win and there’s no need for any minecraft evals or most any other things, if that’s what you mean)