O O comments on My AI Model Delta Compared To Christiano

O O 19 Jun 2024 22:52 UTC
4 points
0
I think a more nuanced take is there is a subset of generated outputs that are hard to verify. This subset is split into two camps, one where you are unsure of the outputs correctness (and thus can reject/ask for an explanation). This isn’t too risky. The other camp is ones where you are sure but in reality overlook something. That’s the risky one.

However at least my priors tell me that the latter is rare with a good reviewer. In a code review, if something is too hard to parse, a good reviewer will ask for an explanation or simplification. But bugs still slip by so it’s imperfect.

The next question is whether the bugs that slip by in the output will be catastrophic. I don’t think it dooms the generation + verification pipeline if the system is designed to be error tolerant.
- Algon 20 Jun 2024 11:12 UTC
  5 points
  0
  Parent
  I’d like to try another analogy, which makes some potential problems for verifying output in alignment more legible.
  
  Imagine you’re a customer and ask a programmer to make you an app. You don’t really know what you want, so you give some vague design criteria. You ask the programmer how the app works, and they tell you, and after a lot of back and forth discussion, you verify this isn’t what you want. Do you know how to ask for what you want, now? Maybe, maybe not.
  
  Perhaps the design space you’re thinking of is small, perhaps you were confused in some simple way that the discussion resolved, perhaps the programmer worked with you earnestly to develop the design you’re really looking for, and pointed out all sorts of unknown unknowns. Perhaps.
  
  I think we could wind up in this position. The position of a non-expert verifying an experts’ output, with some confused and vague ideas about what we want from the experts. We won’t know the good questions to ask the expert, and will have to rely on the expert to help us. If ELK is easy, then that’s not a big issue. If it isn’t, then that seems like a big issue.
  - rif a. saurous 23 Jun 2024 15:38 UTC
    14 points
    4
    Parent
    I feel like a lot of the difficulty here is a punning of the word “problem.”
    In complexity theory, when we talk about “problems”, we generally refer to a formal mathematical question that can be posed as a computational task. Maybe in these kinds of discussions we should start calling these problems_C (for “complexity”). There are plenty of problems_C that are (almost definitely) not in NP, like #SAT (“count the number of satisfying assignments of this Boolean formula”), and it’s generally believed that verification is hard for these problems. A problem_C like #SAT that is (believed to be) in #P but not NP will often have a short easy-to-understand algorithm that will be very slow (“try every assignment and count up the ones that satisfy the formula”).
    On the other hand, “suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values)” is a very different sort of beast. I agree it’s not in NP in that I can’t easily verify a solution, but the issue is that it’s not a problem_C, rather than it being a problem_C that’s (almost definitely) not in NP. With #SAT, I can easily describe how to solve the task using exponential amounts of compute; for “choose a refrigerator”, I can’t describe any computational process that will solve at all. If I could (for instance, if I could write down an evaluation function f : fridge → R (where f was computable in P)), then the problem would be not only in NP but in P (evaluate each fridge, pick the best one).
    So it’s not wrong to say that “choose a refrigerator” is not (known to be) in NP, but it’s important to foreground that that’s because the task isn’t written as a problem_C, rather than because it needs a lot of compute. So discussions about complexity classes and relative ease of generation and verification seem not especially relevant.
    I don’t think I’m saying anything non-obvious, but I also think I’m seeing a lot of discussions that don’t seem to fully internalize this?