the o1 results have illustrated specialized inference scaling laws, for model capabilities in some specialized domains (e.g. math); notably, these don’t seem to generally hold for all domains—e.g. o1 doesn’t seem better than gpt4o at writing;
there’s ongoing work at OpenAI to make generalized inference scaling work;
to the best of my awarenes, current regulatory frameworks/proposals (e.g. the EU AI Act, the Executive Order, SB 1047) frame the capabilities of models in terms of (pre)training compute and maybe fine-tuning compute (e.g. if (pre)training FLOP > 1e26, the developer needs to take various measures), without any similar requirements framed in terms of inference compute; so current regulatory frameworks seem unprepared for translating between indefinite amounts of inference compute and pretraining compute (capabilities);
this would probably be less worrying in the case of models behind APIs, since it should be feasible to monitor for misuse (e.g. CBRN), especially since current inference scaling methods (e.g. CoT) seem relatively transparent; but this seems different for open-weights or leaked models;
notably, it seems like Llama-3 405b only requires 8 good GPUs to run on one’s own hardware (bypassing any API), apparently costing <100k$; the relatively low costs could empower a lot of bad actors, especially since guardrails are easy to remove and potentially more robust methods like unlearning still seem mostly in the research phase;
this would also have consequences for (red-teaming) evals; if running a model for longer potentially leads to more (misuse) uplift, it becomes much harder to upper-bound said uplift;
also, once a model’s weights are out, it might be hard to upperbound all the potential inference uplift that future fine-tunes might provide; e.g. suppose it did become possible to fine-tune 4o into something even better than o1, that could indeed do almost boundless inference scaling; this might then mean that somebody could then apply the same fine-tuning to Llama-3 405b and release that model open-weights, and then (others) mights potentially be able to misuse it in very dangerous ways using a lot of inference compute (but still comparatively low vs. pretraining compute);
OTOH, I do buy the claims from https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/ that until now open-weights have probably been net beneficial for safety, and I expect this could easily keep holding in the near-term (especially as long as open-weights models stay somewhat behind the SOTA); so I’m conflicted about what should be done, especially given the uncertainty around the probability of making generalized inference scaling work and over how much more dangerous this could make current or future open-weights releases; one potentially robust course of action might be to keep doing research on unlearning methods, especially ones robust to fine-tuning, like in Tamper-Resistant Safeguards for Open-Weight LLMs, which might be sufficient to make misuse (especially in some specialized domains, e.g. CBRN) as costly as needing to train from scratch
For similar reasons to the discussion here about why individuals and small businesses might be expected to be able to (differentially) contribute to LM scaffolding (research), I expect them to be able to differentially contribute to [generalized] inference scaling [research]; plausibly also the case for automated ML research agents. Also relevant: Before smart AI, there will be many mediocre or specialized AIs.
Success at currently-researched generalized inference scaling laws might risk jeopardizing some of the fundamental assumptions of current regulatory frameworks.
the o1 results have illustrated specialized inference scaling laws, for model capabilities in some specialized domains (e.g. math); notably, these don’t seem to generally hold for all domains—e.g. o1 doesn’t seem better than gpt4o at writing;
there’s ongoing work at OpenAI to make generalized inference scaling work;
e.g. this could perhaps (though maybe somewhat overambitiously) be framed, in the language of https://epochai.org/blog/trading-off-compute-in-training-and-inference, as there no longer being an upper bound in how many OOMs of inference compute can be traded for equivalent OOMs of pretraining;
to the best of my awarenes, current regulatory frameworks/proposals (e.g. the EU AI Act, the Executive Order, SB 1047) frame the capabilities of models in terms of (pre)training compute and maybe fine-tuning compute (e.g. if (pre)training FLOP > 1e26, the developer needs to take various measures), without any similar requirements framed in terms of inference compute; so current regulatory frameworks seem unprepared for translating between indefinite amounts of inference compute and pretraining compute (capabilities);
this would probably be less worrying in the case of models behind APIs, since it should be feasible to monitor for misuse (e.g. CBRN), especially since current inference scaling methods (e.g. CoT) seem relatively transparent; but this seems different for open-weights or leaked models;
notably, it seems like Llama-3 405b only requires 8 good GPUs to run on one’s own hardware (bypassing any API), apparently costing <100k$; the relatively low costs could empower a lot of bad actors, especially since guardrails are easy to remove and potentially more robust methods like unlearning still seem mostly in the research phase;
this would also have consequences for (red-teaming) evals; if running a model for longer potentially leads to more (misuse) uplift, it becomes much harder to upper-bound said uplift;
also, once a model’s weights are out, it might be hard to upperbound all the potential inference uplift that future fine-tunes might provide; e.g. suppose it did become possible to fine-tune 4o into something even better than o1, that could indeed do almost boundless inference scaling; this might then mean that somebody could then apply the same fine-tuning to Llama-3 405b and release that model open-weights, and then (others) mights potentially be able to misuse it in very dangerous ways using a lot of inference compute (but still comparatively low vs. pretraining compute);
OTOH, I do buy the claims from https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/ that until now open-weights have probably been net beneficial for safety, and I expect this could easily keep holding in the near-term (especially as long as open-weights models stay somewhat behind the SOTA); so I’m conflicted about what should be done, especially given the uncertainty around the probability of making generalized inference scaling work and over how much more dangerous this could make current or future open-weights releases; one potentially robust course of action might be to keep doing research on unlearning methods, especially ones robust to fine-tuning, like in Tamper-Resistant Safeguards for Open-Weight LLMs, which might be sufficient to make misuse (especially in some specialized domains, e.g. CBRN) as costly as needing to train from scratch
For similar reasons to the discussion here about why individuals and small businesses might be expected to be able to (differentially) contribute to LM scaffolding (research), I expect them to be able to differentially contribute to [generalized] inference scaling [research]; plausibly also the case for automated ML research agents. Also relevant: Before smart AI, there will be many mediocre or specialized AIs.