Algon comments on My AI Model Delta Compared To Christiano

Algon 20 Jun 2024 11:12 UTC
5 points
0
I’d like to try another analogy, which makes some potential problems for verifying output in alignment more legible.

Imagine you’re a customer and ask a programmer to make you an app. You don’t really know what you want, so you give some vague design criteria. You ask the programmer how the app works, and they tell you, and after a lot of back and forth discussion, you verify this isn’t what you want. Do you know how to ask for what you want, now? Maybe, maybe not.

Perhaps the design space you’re thinking of is small, perhaps you were confused in some simple way that the discussion resolved, perhaps the programmer worked with you earnestly to develop the design you’re really looking for, and pointed out all sorts of unknown unknowns. Perhaps.

I think we could wind up in this position. The position of a non-expert verifying an experts’ output, with some confused and vague ideas about what we want from the experts. We won’t know the good questions to ask the expert, and will have to rely on the expert to help us. If ELK is easy, then that’s not a big issue. If it isn’t, then that seems like a big issue.
- rif a. saurous 23 Jun 2024 15:38 UTC
  14 points
  4
  Parent
  I feel like a lot of the difficulty here is a punning of the word “problem.”
  In complexity theory, when we talk about “problems”, we generally refer to a formal mathematical question that can be posed as a computational task. Maybe in these kinds of discussions we should start calling these problems_C (for “complexity”). There are plenty of problems_C that are (almost definitely) not in NP, like #SAT (“count the number of satisfying assignments of this Boolean formula”), and it’s generally believed that verification is hard for these problems. A problem_C like #SAT that is (believed to be) in #P but not NP will often have a short easy-to-understand algorithm that will be very slow (“try every assignment and count up the ones that satisfy the formula”).
  On the other hand, “suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values)” is a very different sort of beast. I agree it’s not in NP in that I can’t easily verify a solution, but the issue is that it’s not a problem_C, rather than it being a problem_C that’s (almost definitely) not in NP. With #SAT, I can easily describe how to solve the task using exponential amounts of compute; for “choose a refrigerator”, I can’t describe any computational process that will solve at all. If I could (for instance, if I could write down an evaluation function f : fridge → R (where f was computable in P)), then the problem would be not only in NP but in P (evaluate each fridge, pick the best one).
  So it’s not wrong to say that “choose a refrigerator” is not (known to be) in NP, but it’s important to foreground that that’s because the task isn’t written as a problem_C, rather than because it needs a lot of compute. So discussions about complexity classes and relative ease of generation and verification seem not especially relevant.
  I don’t think I’m saying anything non-obvious, but I also think I’m seeing a lot of discussions that don’t seem to fully internalize this?