Joar Skalse comments on Some Arguments Against Strong Scaling

Joar Skalse 16 Jan 2023 10:07 UTC
1 point
0
“human-level question answering is believed to be AI-complete”—I doubt that. I think that [...]
Yes, these are also good points. Human-level question answering is often listed as a candidate for an AI-complete problem, but there are of course people who disagree. I’m inclined to say that question-answering probably is AI-complete, but that belief is not very strongly held. In your example of the painter; you could still convey a low-resolution version of the image as a grid of flat colours (similar to how images are represented in computers), and tell the painter to first paint that out, and then paint another version of what the grid image depicts.
We don’t care that match about specific phrasing, and use the “loss” of how match the content make sense, is true, is useful...
Yes, I agree. Humans are certainly better than the GPTs at producing “representative” text, rather than text that is likely on a word-by-word basis. My point there was just to show that “reaching human-level performance on next-token prediction” does not correspond to human-level intelligence (and has already been reached).
Memorization & generalisation—just noting that it is a spectrum rather than a dichotomy, as compression ratios are. Anyway, the current methods don’t seem to generalise well enough to overcome the sparsity of public data in some domains—which may be the main bottleneck in (e.g.) RL anyway.
I agree.
let’s spell the obvious objection—it is obviously possible to implement discrete representations over continuous representations. This is why we can have digital computers that are based on electrical currents rather than little rocks. The problem is just that keeping it robustly discrete is hard, and probably very hard to learn.
Of course. The main question is if it is at all possible to actually learn these representations in a reasonable way. The main benefit from these kinds of representations would come from a much better ability to generalise, and this is only useful if they are also reasonably easy to learn. Consider my example with an MLP learning an identity function—it can learn it, but it is by “memorising” it rather than “actually learning” it. For AGI, we would need a system that can learn combinatorial representations quickly, rather than learn them in the way that an MLP learns an identity function.
I think that problem may be solved easily with minor changes of architecture though, and therefore should not effect timelines.
Maybe, that remains to be seen. My impression is that the most senior AI researchers (Yoshua Bengio, Yann LeCun, Stuart Russell, etc) lean in the other direction (but I could be wrong about this). As I said, I feel a bit confused/uncertain about the force of the LoT argument.
Inductive logic programming—generalise well in a much more restricted hypothesis space, as one should expect based on learning theory.
To me, it is not at all obvious that ILP systems have a more restricted hypothesis space than deep learning systems. If anything, I would expect it to be the other way around (though this of course depends on the particular system—I have mainly used metagol). Rather, the way I think of it is that ILP systems have a much stronger simplicity bias than deep learning systems, and that this is the main reason for why they can generalise better from small amounts of data (and the reason they don’t work well in practice is that this training method is too expensive for more large-scale problems).
- Ben Amitay 16 Jan 2023 11:09 UTC
  1 point
  0
  Parent
  Thanks for the detailed response. I think we agree about most of the things that matter, but about the rest:
  
  About the loss function for next word prediction—my point was that I’m not sure whether the current GPT is already superhuman even in the prediction that we care about. It may be wrong less, but in ways that we count as more important. I agree that changing to a better loss will not make it significantly harder to learn it any more the same as intelligence etc.
  
  About solving discrete representations with architectural change—I think that I meant only that the representation is easy and not the training, but anyway I agree that training it may be hard or at least require non-standard methods.
  
  About the inductive logic and describing pictures in low-resolution: I made the same communication mistake in both, which is to consider things that are ridiculously highly regulated as not part of the hypothesis space at all. There probably is a logical formula that describe the probability of a given image to be a cat, to every degree of precision. I claim that will will never be able to find or represent that formula, because it is so regulated against. And that this is the price that the theory forced us to pay for the generalisation.