jsteinhardt comments on What’s Up With Confusingly Pervasive Goal Directedness?

jsteinhardt Jan 23, 2022, 5:52 PM
10 points
Thanks for the push-back and the clear explanation. I still think my points hold and I’ll try to explain why below.
In order to even get a single expected datapoint of approval, I need to sample 10^8 examples, which in our current sampling method would take 10^8 * 10 hours, e.g. approximately 100,000 years. I don’t understand how you could do “Learning from Human Preferences” on something this sparse
This is true if all the other datapoints are entirely indistinguishable, and the only signal is “good” vs. “bad”. But in practice you would compare / rank the datapoints, and move towards the ones that are better.
Take the backflip example from the human preferences paper: if your only signal was “is this a successful backflip?”, then your argument would apply and it would be pretty hard to learn. But the signal is “is this more like a successful backflip than this other thing?” and this makes learning feasible.
More generally, I feel that the thing I’m arguing against would imply that ML in general is impossible (and esp. the human preferences work), so I think it would help to say explicitly where the disanalogy occurs.
I should note that comparisons is only one reason why the situation isn’t as bad as you say. Another is that even with only non-approved data points to label, you could do things like label “which part” of the plan is non-approved. And with very sophisticated AI systems, you could ask them to predict which plans would be approved/non-approved, even if they don’t have explicit examples, simply by modeling the human approvers very accurately in general.
I feel even beyond that, this still assumes that the reason it is proposing a “good” plan is pure noise, and not the result of any underlying bias that is actually costly to replace.
When you say “costly to replace”, this is with respect to what cost function? Do you have in mind the system’s original training objective, or something else?
If you have an original cost function F(x) and an approval cost A(x), you can minimize F(x) + c * A(x), increasing the weight on c until it pays enough attention to A(x). For an appropriate choice of c, this is (approximately) equivalent to asking “Find the most approved policy such that F(x) is below some threshold”—more generally, varying c will trace out the Pareto boundary between F and A.
so even if we get within 33 bits (which I do think seems unlikely)
Yeah, I agree 33 bits would be way too optimistic. My 50% CI is somewhere between 1,000 and 100,000 bits needed. It just seems unlikely to me that you’d be able to generate, say, 100 bits but then run into a fundamental obstacle after that (as opposed to an engineering / cost obstacle).
Like, I feel like… this is literally a substantial part of the P vs. NP problem, and I can’t just assume my algorithm just like finds efficient solution to arbitrary NP-hard problems.
I don’t think the P vs. NP analogy is a good one here, for a few reasons:
* The problems you’re talking about above are statistical issues (you’re saying you can’t get any statistical signal), while P vs. NP is a computational question.
* In general, I think P vs. NP is a bad fit for ML. Invoking related intuitions would have led you astray over the past decade—for instance, predicting that neural networks should not perform well because they are solving a problem (non-convex optimization) that is NP-hard in the worst case.
- habryka Jan 23, 2022, 7:29 PM
  2 points
  Parent
  This is true if all the other datapoints are entirely indistinguishable, and the only signal is “good” vs. “bad”. But in practice you would compare / rank the datapoints, and move towards the ones that are better.
  Take the backflip example from the human preferences paper: if your only signal was “is this a successful backflip?”, then your argument would apply and it would be pretty hard to learn. But the signal is “is this more like a successful backflip than this other thing?” and this makes learning feasible.
  More generally, I feel that the thing I’m arguing against would imply that ML in general is impossible (and esp. the human preferences work), so I think it would help to say explicitly where the disanalogy occurs.
  I should note that comparisons is only one reason why the situation isn’t as bad as you say. Another is that even with only non-approved data points to label, you could do things like label “which part” of the plan is non-approved. And with very sophisticated AI systems, you could ask them to predict which plans would be approved/non-approved, even if they don’t have explicit examples, simply by modeling the human approvers very accurately in general.
  Well, sure, but that is changing the problem formulation quite a bit. It’s also not particularly obvious that it helps very much, though I do agree it helps. My guess is even with a rank-ordering, you won’t get the 33 bits out of the system in any reasonable amount of time at 10 hours evaluation cost. I do think if you can somehow give more mechanistic and detailed feedback, I feel more optimistic in situations like this, but also feel more pessimistic that we will actually figure out how to do that in situations like this.
  More generally, I feel that the thing I’m arguing against would imply that ML in general is impossible (and esp. the human preferences work), so I think it would help to say explicitly where the disanalogy occurs.
  I feel like you are arguing for a very strong claim here, which is that “as soon as you have an efficient way of determining whether a problem is solved, and any way of generating a correct solution some very small fraction of the time, you can just build an efficient solution that solves it all of the time”.
  This sentence can of course be false without implying that the human preferences work is impossible, so there must be some confusion happening. I am not arguing that this is impossible for all problems, indeed ML has shown that this is indeed quite feasible for a lot of problems, but making the claim that it works for all of them is quite strong, but I also feel like it’s obvious enough that this is very hard or impossible for a large other class of problems (like, e.g. reversing hash functions), and so we shouldn’t assume that we can just do this for an arbitrary problem.
  - jsteinhardt Jan 23, 2022, 8:29 PM
    4 points
    Parent
    I feel like you are arguing for a very strong claim here, which is that “as soon as you have an efficient way of determining whether a problem is solved, and any way of generating a correct solution some very small fraction of the time, you can just build an efficient solution that solves it all of the time”
    Hm, this isn’t the claim I intended to make. Both because it overemphasizes on “efficient” and because it adds a lot of “for all” statements.
    If I were trying to state my claim more clearly, it would be something like “generically, for the large majority of problems of the sort you would come across in ML, once you can distinguish good answers you can find good answers (modulo some amount of engineering work), because non-convex optimization generally works and there are a large number of techniques for solving the sparse rewards problem, which are also getting better over time”.
    - habryka Jan 23, 2022, 11:21 PM
      3 points
      Parent
      If I were trying to state my claim more clearly, it would be something like “generically, for the large majority of problems of the sort you would come across in ML, once you can distinguish good answers you can find good answers (modulo some amount of engineering work), because non-convex optimization generally works and there are a large number of techniques for solving the sparse rewards problem, which are also getting better over time”.
      I am a bit confused by what we mean by “of the sort you would come across in ML”. Like, this situation, where we are trying to derive an algorithm that solves problems without optimizers, from an algorithm that solves problems with optimizers, is that “the sort of problem you would come across in ML”?. It feels pretty different to me from most usual ML problems.
      I also feel like in ML it’s quite hard to actually do this in practice. Like, it’s very easy to tell whether a self-driving car AI has an accident, but not very easy to actually get it to not have any accidents. It’s very easy to tell whether an AI can produce a Harry Potter-level quality novel, but not very easy to get it to produce one. It’s very easy to tell if an AI has successfully hacked some computer system, but very hard to get it to actually do so. I feel like the vast majority of real-world problems we want to solve do not currently follow the rule of “if you can distinguish good answers you can find good answers”. Of course, success in ML has been for the few subproblems where this turned out to be easy, but clearly our prior should be on this not working out, given the vast majority of problems where this turned out to be hard.
      (Also, to be clear, I think you are making a good point here, and I am pretty genuinely confused for which kind of problems the thing you are saying does turn out to be true, and appreciate your thoughts here)
- habryka Jan 23, 2022, 7:15 PM
  2 points
  Parent
  When you say “costly to replace”, this is with respect to what cost function? Do you have in mind the system’s original training objective, or something else?
  If you have an original cost function F(x) and an approval cost A(x), you can minimize F(x) + c * A(x), increasing the weight on c until it pays enough attention to A(x). For an appropriate choice of c, this is (approximately) equivalent to asking “Find the most approved policy such that F(x) is below some threshold”—more generally, varying c will trace out the Pareto boundary between F and A.
  I was talking about “costly” in terms of computational resources. Like, of course if I have a system that gets the right answer in ¹⁄_100,000,000 cases, and I have a way to efficiently tell when it gets the right answer, then I can get it to always give me approximately always the right answer by just running it a billion times. But that will also take a billion times longer.
  In-practice, I expect most situations where you have the combination of “In one in a billion cases I get the right answer and it costs me $1 to compute an answer” and “I can tell when it gets the right answer”, you won’t get to a point where you can compute a right answer for anything close to $1.