Yonatan Cale comments on Yonatan Cale’s Shortform

Yonatan Cale 29 Nov 2024 13:17 UTC
1 point
0
Thanks!
In the part you quoted—my main question would be “do you plan on giving the agent examples of good/bad norm following” (such as RLHFing it). If so—I think it would miss the point, because following those norms would become in-distribution, and so we wouldn’t learn if our alignment generalizes out of distribution without something-like-RLHF for that distribution. That’s the main thing I think worth testing here. (do you agree? I can elaborate on why I think so)
If you hope to check if the agent will be aligned^[1] with no minecraft-specific alignment training, then sounds like we’re on the same page!
Regarding the rest of the article—it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it’s easy).
My only comment there is that I’d try to not give the agent feedback about human values (like “is the waterfall pretty”) but only about clearly defined objectives (like “did it kill the dragon”), in order to not accidentally make human values in minecraft be in-distribution for this agent. wdyt?
(I hope I didn’t misunderstand something important in the article, feel free to correct me of course)
1. ^
  Whatever “aligned” means. “other players have fun on this minecraft server” is one example.
- Rohin Shah 30 Nov 2024 14:07 UTC
  4 points
  0
  Parent
  Regarding the rest of the article—it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it’s easy).
  Huh. If you think of that as capabilities I don’t know what would count as alignment. What’s an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
  E.g. it seems like you think RLHF counts as an alignment technique—this seems like a central approach that you might use in BASALT.
  If you hope to check if the agent will be aligned with no minecraft-specific alignment training, then sounds like we’re on the same page!
  I don’t particularly imagine this, because you have to somehow communicate to the AI system what you want it to do, and AI systems don’t seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
  - Yonatan Cale 9 Dec 2024 23:46 UTC
    1 point
    0
    Parent
    TL;DR: point 3 is my main one.
    1)
    What’s an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
    [I’m not sure why you’re asking, maybe I’m missing something, but I’ll answer]
    For example, checking if human values are a “natural abstraction”, or trying to express human values in a machine readable format, or getting an AI to only think in human concepts, or getting an AI that is trained on a limited subset of things-that-imply-human-preferences to generalize well out of that distribution.
    I can make up more if that helps? anyway my point was just to say explicitly what parts I’m commenting on and why (in case I missed something)
    2)
    it seems like you think RLHF counts as an alignment technique
    It’s a candidate alignment technique.
    RLHF is sometimes presented (by others) as an alignment technique that should give us hope about AIs simply understanding human values and applying them in out of distribution situations (such as with an ASI).
    I’m not optimistic about that myself, but rather than arguing against it, I suggest we could empirically check if RLHF generalizes to an out-of-distribution situation, such as minecraft maybe. I think observing the outcome here would effect my opinion (maybe it just would work?), and a main question of mine was whether it would effect other people’s opinions too (whether they do or don’t believe that RLHF is a good alignment technique).
    3)
    because you have to somehow communicate to the AI system what you want it to do, and AI systems don’t seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
    I would finetune the AI on objective outcomes like “fill this chest with gold” or “kill that creature [the dragon]” or “get 100 villagers in this area”. I’d pick these goals as ones that require the AI to be a capable minecraft player (filling a chest with gold is really hard) but don’t require the AI to understand human values or ideally anything about humans at all.
    So I’d avoid finetuning it on things like “are other players having fun” or “build a house that would be functional for a typical person” or “is this waterfall pretty [subjectively, to a human]”.
    Does this distinction seem clear? useful?
    This would let us test how some specific alignment technique (such as “RLHF that doesn’t contain minecraft examples”) generalizes to minecraft