Thanks for the detailed response. I think we agree about most of the things that matter, but about the rest:
About the loss function for next word prediction—my point was that I’m not sure whether the current GPT is already superhuman even in the prediction that we care about. It may be wrong less, but in ways that we count as more important. I agree that changing to a better loss will not make it significantly harder to learn it any more the same as intelligence etc.
About solving discrete representations with architectural change—I think that I meant only that the representation is easy and not the training, but anyway I agree that training it may be hard or at least require non-standard methods.
About the inductive logic and describing pictures in low-resolution: I made the same communication mistake in both, which is to consider things that are ridiculously highly regulated as not part of the hypothesis space at all. There probably is a logical formula that describe the probability of a given image to be a cat, to every degree of precision. I claim that will will never be able to find or represent that formula, because it is so regulated against. And that this is the price that the theory forced us to pay for the generalisation.
Thanks for the detailed response. I think we agree about most of the things that matter, but about the rest:
About the loss function for next word prediction—my point was that I’m not sure whether the current GPT is already superhuman even in the prediction that we care about. It may be wrong less, but in ways that we count as more important. I agree that changing to a better loss will not make it significantly harder to learn it any more the same as intelligence etc.
About solving discrete representations with architectural change—I think that I meant only that the representation is easy and not the training, but anyway I agree that training it may be hard or at least require non-standard methods.
About the inductive logic and describing pictures in low-resolution: I made the same communication mistake in both, which is to consider things that are ridiculously highly regulated as not part of the hypothesis space at all. There probably is a logical formula that describe the probability of a given image to be a cat, to every degree of precision. I claim that will will never be able to find or represent that formula, because it is so regulated against. And that this is the price that the theory forced us to pay for the generalisation.