but there is no clear conversion between benchmarks and ability to take over the world/design nanotech/insert any other interesting capability.
I would be willing to be a reasonable sum of money that “designing nanotech” is in the set of problems where it is possible to trade inference-compute for training compute. It has the same shape as many problems in which inference-scaling works (for example solving math problems). If you have some design-critera for a world-destroying nanobot, and you get to choose between training a better nanobot-designing-AI and running your nano-bot AI for longer, you almost certainly want to do both. That is to say, finding a design world-destroying-nanobot feels very much like a classic search problem where you have some acceptance criteria, a design space, and a model that gives you a prior over which parts of the space you should search first.
I would be willing to be a reasonable sum of money that “designing nanotech” is in the set of problems where it is possible to trade inference-compute for training compute. It has the same shape as many problems in which inference-scaling works (for example solving math problems). If you have some design-critera for a world-destroying nanobot, and you get to choose between training a better nanobot-designing-AI and running your nano-bot AI for longer, you almost certainly want to do both. That is to say, finding a design world-destroying-nanobot feels very much like a classic search problem where you have some acceptance criteria, a design space, and a model that gives you a prior over which parts of the space you should search first.