Evaluating alignment research is much easier than doing it.
Alignment research will only require narrow AI.
Nice comprehension of the different takeoff scenarios!
I am no researcher in this area, and I also know I might be wrong about many things in the following. But have doubts about the two above statements.
Evaluating alignment is still manageable right now. We are still smarter than the AI, at least somewhat. However, I do not see a viable path to evaluate the true level of capabilities of AI once it is smarter than us. Once that point is reached, we will only be able to ask questions we do not know the answers to to evaluate how smart the model is, but by definition we also do not know how smart you have to be to answer the questions. Is solving the riemann hypothesis something that is just outside our grasp or is 1000x more intelligence than ours needed? We cant reliably say.
I might be wrong and there is some science or theory that does exactly that, but I do not know of one.
And the same is true with alignment. Once the AI is smarter than us we can not assume that our tests of the model output work anymore. Considering that even right now our tests are seemingly not very good (At least according to the youtube video from AI Explained) and we did not notice for this long, I do not think we will be able to rely on the questionaires we use right now anymore, as it might behave differently if it notices we test it. And it might notice we test it from the first question we ask it.
This means, evaluating alignment research is in fact also incredibly hard. We need to outwit a smarter entity or directly interpret what happens inside the model. To know we missed no cases during that process is harder than devising a test that covers most cases.
The second part I personally wonder a bit about. On the one hand it might be possible that we can use many different AIs for every field that are highly specialized. But that would struggle with connections between those fields, so if we have a chemistry and biology AI we might not fully cover biochemistry. If we have a biochemistry AI we might not fully cover medicine. Then there is food. Once we get to food, we also need to watch physics, like radiation or material sciences.
And in all of that we might still want certain things like let it write a book about Oppenheimer and how he built a nuclear bomb, so it also needs to be able to look at context in terms of an artistic standpoint. It needs to be able to evaluate if it is wrong on purpose out of context, such as a book, or if it seriously attempted to do it but got it wrong this time.
I feel like this can only be achived with another general AI, not a narrow AI that is much more narrow than the tested AI. Else the tested AI might simply get away with a wider “scheme”.
Another slight note about claiming lower models will evaluate higher models, if the current trend of aligned AIs being less capabale than unaligned AIs stays this way, this is a bad idea. You showed lots of linear curves here, but the y axis should be logarithmic in terms of capabilites. This means the distance between gpt5 and gpt6 might be 10x or in a similar region, especially if the smarter model is yet to be aligned and the other model is already reigned in.
As explained earlier, external testing of the model by a less intelligent entity becomes almost impossible in my opinion. I am unsure about how much a finetuned version might be able to close the gap, but my other point also shows that finetuning will only get us so far, as we cant narrow it down too far. For better than AGI (with AGI being as smart as the best human experts in every tasks) we very likely need to fully understand what happens inside the model to align it. But I really hope this is not the case, as I do not see people pausing long enough to seriously put the effort into that.
Nice comprehension of the different takeoff scenarios!
I am no researcher in this area, and I also know I might be wrong about many things in the following. But have doubts about the two above statements.
Evaluating alignment is still manageable right now. We are still smarter than the AI, at least somewhat. However, I do not see a viable path to evaluate the true level of capabilities of AI once it is smarter than us. Once that point is reached, we will only be able to ask questions we do not know the answers to to evaluate how smart the model is, but by definition we also do not know how smart you have to be to answer the questions. Is solving the riemann hypothesis something that is just outside our grasp or is 1000x more intelligence than ours needed? We cant reliably say.
I might be wrong and there is some science or theory that does exactly that, but I do not know of one.
And the same is true with alignment. Once the AI is smarter than us we can not assume that our tests of the model output work anymore. Considering that even right now our tests are seemingly not very good (At least according to the youtube video from AI Explained) and we did not notice for this long, I do not think we will be able to rely on the questionaires we use right now anymore, as it might behave differently if it notices we test it. And it might notice we test it from the first question we ask it.
This means, evaluating alignment research is in fact also incredibly hard. We need to outwit a smarter entity or directly interpret what happens inside the model. To know we missed no cases during that process is harder than devising a test that covers most cases.
The second part I personally wonder a bit about. On the one hand it might be possible that we can use many different AIs for every field that are highly specialized. But that would struggle with connections between those fields, so if we have a chemistry and biology AI we might not fully cover biochemistry. If we have a biochemistry AI we might not fully cover medicine. Then there is food. Once we get to food, we also need to watch physics, like radiation or material sciences.
And in all of that we might still want certain things like let it write a book about Oppenheimer and how he built a nuclear bomb, so it also needs to be able to look at context in terms of an artistic standpoint. It needs to be able to evaluate if it is wrong on purpose out of context, such as a book, or if it seriously attempted to do it but got it wrong this time.
I feel like this can only be achived with another general AI, not a narrow AI that is much more narrow than the tested AI. Else the tested AI might simply get away with a wider “scheme”.
Another slight note about claiming lower models will evaluate higher models, if the current trend of aligned AIs being less capabale than unaligned AIs stays this way, this is a bad idea. You showed lots of linear curves here, but the y axis should be logarithmic in terms of capabilites. This means the distance between gpt5 and gpt6 might be 10x or in a similar region, especially if the smarter model is yet to be aligned and the other model is already reigned in.
As explained earlier, external testing of the model by a less intelligent entity becomes almost impossible in my opinion. I am unsure about how much a finetuned version might be able to close the gap, but my other point also shows that finetuning will only get us so far, as we cant narrow it down too far. For better than AGI (with AGI being as smart as the best human experts in every tasks) we very likely need to fully understand what happens inside the model to align it. But I really hope this is not the case, as I do not see people pausing long enough to seriously put the effort into that.