The already-feasibility of https://sakana.ai/ai-scientist/ (with basically non-x-risky systems, sub-ASL-3, and bad at situational awareness so very unlikely to be scheming) has updated me significantly on the tractability of the alignment / control problem. More than ever, I expect it’s gonna be relatively tractable (if done competently and carefully) to safely, iteratively automate parts of AI safety research, all the way up to roughly human-level automated safety research (using LLM agents roughly-shaped like the AI scientist, with humans in the loop; in a control framework https://lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled…, ideally). Add some decent regulation (especially vs. uncontrolled intelligence explosions using AI scientists inside the largest AI labs) + paretotopian coordination https://aiprospects.substack.com/p/paretotopian-goal-alignment… (perhaps especially at the international level) and I think we might have all the basic ingredients of a ‘pivotal act’ in the best sense of the term. (Easier said than done, though).
I’d love to see at least several Sakana-like AI safety orgs, focused on using LLM agents to automate / strongly augment various parts of safety research. And ideally putting those tools (technically agents) in the hands of hundreds / thousands of (including independent) AI safety researchers. Unsure if the AI safety funding / field-building infrastructure is anywhere near ready for this level of potential scale-up, though. I expect quite a few safety research areas to be prime candidates for early automation / strong augmentation, especially the ones with relatively short and precise feedback loops (in both code and compute), e.g. unlearning, activation/representation engineering (and applications) and large parts of LLM fine-tuning (e.g. https://arxiv.org/abs/2406.08414 for a prototype).
I think currently approximately no one is working on the kind of safety research that when scaled up would actually help with aligning substantially smarter than human agents, so I am skeptical that the people at labs could automate that kind of work (given that they are basically doing none of it). I find myself frustrated with people talking about automating safety research, when as far as I can tell we have made no progress on the relevant kind of work in the last ~5 years.
when as far as I can tell we have made no progress on the relevant kind of work in the last ~5 years.
Can you give some examples of work which you do think represents progress?
My rough mental model is something like: LLMs are trained using very large-scale (multi-task) behavior cloning, so I expect various tasks to be increasingly automated / various data distributions to be successfully, all the way to (in the limit) at roughly everything humans can do, as a lower bound; including e.g. the distribution of what tends to get posted on LessWrong / the Alignment Forum (and agent foundations-like work); especially when LLMs are scaffolded into agents, given access to tools, etc. With the caveats that the LLM [agent] paradigm might not get there before e.g. it runs out of data / compute, or that the systems might be unsafe to use; but against this was where https://sakana.ai/ai-scientist/ was a particularly significant update for me; and ‘iteratively automate parts of AI safety research’ is also supposed to help with keeping systems safe as they become increasingly powerful.
(cross-posted from X/twitter)
The already-feasibility of https://sakana.ai/ai-scientist/ (with basically non-x-risky systems, sub-ASL-3, and bad at situational awareness so very unlikely to be scheming) has updated me significantly on the tractability of the alignment / control problem. More than ever, I expect it’s gonna be relatively tractable (if done competently and carefully) to safely, iteratively automate parts of AI safety research, all the way up to roughly human-level automated safety research (using LLM agents roughly-shaped like the AI scientist, with humans in the loop; in a control framework https://lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled…, ideally). Add some decent regulation (especially vs. uncontrolled intelligence explosions using AI scientists inside the largest AI labs) + paretotopian coordination https://aiprospects.substack.com/p/paretotopian-goal-alignment… (perhaps especially at the international level) and I think we might have all the basic ingredients of a ‘pivotal act’ in the best sense of the term. (Easier said than done, though).
I’d love to see at least several Sakana-like AI safety orgs, focused on using LLM agents to automate / strongly augment various parts of safety research. And ideally putting those tools (technically agents) in the hands of hundreds / thousands of (including independent) AI safety researchers. Unsure if the AI safety funding / field-building infrastructure is anywhere near ready for this level of potential scale-up, though. I expect quite a few safety research areas to be prime candidates for early automation / strong augmentation, especially the ones with relatively short and precise feedback loops (in both code and compute), e.g. unlearning, activation/representation engineering (and applications) and large parts of LLM fine-tuning (e.g. https://arxiv.org/abs/2406.08414 for a prototype).
I think currently approximately no one is working on the kind of safety research that when scaled up would actually help with aligning substantially smarter than human agents, so I am skeptical that the people at labs could automate that kind of work (given that they are basically doing none of it). I find myself frustrated with people talking about automating safety research, when as far as I can tell we have made no progress on the relevant kind of work in the last ~5 years.
Can you give some examples of work which you do think represents progress?
My rough mental model is something like: LLMs are trained using very large-scale (multi-task) behavior cloning, so I expect various tasks to be increasingly automated / various data distributions to be successfully, all the way to (in the limit) at roughly everything humans can do, as a lower bound; including e.g. the distribution of what tends to get posted on LessWrong / the Alignment Forum (and agent foundations-like work); especially when LLMs are scaffolded into agents, given access to tools, etc. With the caveats that the LLM [agent] paradigm might not get there before e.g. it runs out of data / compute, or that the systems might be unsafe to use; but against this was where https://sakana.ai/ai-scientist/ was a particularly significant update for me; and ‘iteratively automate parts of AI safety research’ is also supposed to help with keeping systems safe as they become increasingly powerful.