A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident.
Also, there are notable researchers and companies working on developing ‘a truly general way of scaling inference compute’ right now and I think it would be cautious to consider what happens if they succeed.
(This also has implications for automating AI safety research).
(This also has implications for automating AI safety research).
To spell it out more explicitly, the current way of scaling inference (CoT) seems pretty good vs. some of the most worrying threat models, which often depend on opaque model internals.
Also, there are notable researchers and companies working on developing ‘a truly general way of scaling inference compute’ right now and I think it would be cautious to consider what happens if they succeed.
(This also has implications for automating AI safety research).
To spell it out more explicitly, the current way of scaling inference (CoT) seems pretty good vs. some of the most worrying threat models, which often depend on opaque model internals.