Well even if language models do generalize beyond their training domain in the way that humans can, you still need to be in contact with a given problem in order to solve that problem. Suppose I take a very intelligent human and ask them to become a world expert at some game X, but I don’t actually tell them the rules of game X nor give them any way of playing out game X. No matter how intelligent the person is, they still need some information about what the game consists of.
Now suppose that you have this intelligent person write essays about how one ought to play game X, and have their essays assessed by other humans who have some familiarity with game X but not a clear understanding. It is not impossible that this could work, but it does seem unlikely. There are a lot of levels of indirection stacked against this working.
So overall I’m not saying that language models can’t be generally intelligent, I’m saying that a generally intelligent entity still needs to be in a tight feedback loop with the problem itself (whatever that is).
This makes sense, but it seems to be a fundamental difficulty of the alignment problem itself as opposed to the ability of any particular system to solve it. If the language model is superintelligent and knows everything we know, I would expect it to be able to evaluate its own alignment research as well as if not better than us. The problem is that it can’t get any feedback about whether its ideas actually work from empirical reality given the issues with testing alignment problems, not that it can’t get feedback from another intelligent grader/assessor reasoning in a ~a priori way.
Well even if language models do generalize beyond their training domain in the way that humans can, you still need to be in contact with a given problem in order to solve that problem. Suppose I take a very intelligent human and ask them to become a world expert at some game X, but I don’t actually tell them the rules of game X nor give them any way of playing out game X. No matter how intelligent the person is, they still need some information about what the game consists of.
Now suppose that you have this intelligent person write essays about how one ought to play game X, and have their essays assessed by other humans who have some familiarity with game X but not a clear understanding. It is not impossible that this could work, but it does seem unlikely. There are a lot of levels of indirection stacked against this working.
So overall I’m not saying that language models can’t be generally intelligent, I’m saying that a generally intelligent entity still needs to be in a tight feedback loop with the problem itself (whatever that is).
This makes sense, but it seems to be a fundamental difficulty of the alignment problem itself as opposed to the ability of any particular system to solve it. If the language model is superintelligent and knows everything we know, I would expect it to be able to evaluate its own alignment research as well as if not better than us. The problem is that it can’t get any feedback about whether its ideas actually work from empirical reality given the issues with testing alignment problems, not that it can’t get feedback from another intelligent grader/assessor reasoning in a ~a priori way.