habryka comments on What’s Up With Confusingly Pervasive Goal Directedness?

habryka 23 Jan 2022 19:29 UTC
2 points
This is true if all the other datapoints are entirely indistinguishable, and the only signal is “good” vs. “bad”. But in practice you would compare / rank the datapoints, and move towards the ones that are better.
Take the backflip example from the human preferences paper: if your only signal was “is this a successful backflip?”, then your argument would apply and it would be pretty hard to learn. But the signal is “is this more like a successful backflip than this other thing?” and this makes learning feasible.
More generally, I feel that the thing I’m arguing against would imply that ML in general is impossible (and esp. the human preferences work), so I think it would help to say explicitly where the disanalogy occurs.
I should note that comparisons is only one reason why the situation isn’t as bad as you say. Another is that even with only non-approved data points to label, you could do things like label “which part” of the plan is non-approved. And with very sophisticated AI systems, you could ask them to predict which plans would be approved/non-approved, even if they don’t have explicit examples, simply by modeling the human approvers very accurately in general.
Well, sure, but that is changing the problem formulation quite a bit. It’s also not particularly obvious that it helps very much, though I do agree it helps. My guess is even with a rank-ordering, you won’t get the 33 bits out of the system in any reasonable amount of time at 10 hours evaluation cost. I do think if you can somehow give more mechanistic and detailed feedback, I feel more optimistic in situations like this, but also feel more pessimistic that we will actually figure out how to do that in situations like this.
More generally, I feel that the thing I’m arguing against would imply that ML in general is impossible (and esp. the human preferences work), so I think it would help to say explicitly where the disanalogy occurs.
I feel like you are arguing for a very strong claim here, which is that “as soon as you have an efficient way of determining whether a problem is solved, and any way of generating a correct solution some very small fraction of the time, you can just build an efficient solution that solves it all of the time”.
This sentence can of course be false without implying that the human preferences work is impossible, so there must be some confusion happening. I am not arguing that this is impossible for all problems, indeed ML has shown that this is indeed quite feasible for a lot of problems, but making the claim that it works for all of them is quite strong, but I also feel like it’s obvious enough that this is very hard or impossible for a large other class of problems (like, e.g. reversing hash functions), and so we shouldn’t assume that we can just do this for an arbitrary problem.
- jsteinhardt 23 Jan 2022 20:29 UTC
  4 points
  Parent
  I feel like you are arguing for a very strong claim here, which is that “as soon as you have an efficient way of determining whether a problem is solved, and any way of generating a correct solution some very small fraction of the time, you can just build an efficient solution that solves it all of the time”
  Hm, this isn’t the claim I intended to make. Both because it overemphasizes on “efficient” and because it adds a lot of “for all” statements.
  If I were trying to state my claim more clearly, it would be something like “generically, for the large majority of problems of the sort you would come across in ML, once you can distinguish good answers you can find good answers (modulo some amount of engineering work), because non-convex optimization generally works and there are a large number of techniques for solving the sparse rewards problem, which are also getting better over time”.
  - habryka 23 Jan 2022 23:21 UTC
    3 points
    Parent
    If I were trying to state my claim more clearly, it would be something like “generically, for the large majority of problems of the sort you would come across in ML, once you can distinguish good answers you can find good answers (modulo some amount of engineering work), because non-convex optimization generally works and there are a large number of techniques for solving the sparse rewards problem, which are also getting better over time”.
    I am a bit confused by what we mean by “of the sort you would come across in ML”. Like, this situation, where we are trying to derive an algorithm that solves problems without optimizers, from an algorithm that solves problems with optimizers, is that “the sort of problem you would come across in ML”?. It feels pretty different to me from most usual ML problems.
    I also feel like in ML it’s quite hard to actually do this in practice. Like, it’s very easy to tell whether a self-driving car AI has an accident, but not very easy to actually get it to not have any accidents. It’s very easy to tell whether an AI can produce a Harry Potter-level quality novel, but not very easy to get it to produce one. It’s very easy to tell if an AI has successfully hacked some computer system, but very hard to get it to actually do so. I feel like the vast majority of real-world problems we want to solve do not currently follow the rule of “if you can distinguish good answers you can find good answers”. Of course, success in ML has been for the few subproblems where this turned out to be easy, but clearly our prior should be on this not working out, given the vast majority of problems where this turned out to be hard.
    (Also, to be clear, I think you are making a good point here, and I am pretty genuinely confused for which kind of problems the thing you are saying does turn out to be true, and appreciate your thoughts here)