I’ve tried speaking with a few teams doing AI safety work, including: • assistant professor leading an alignment research group at a top university who is starting a new AI safety org • anthropic independent contractor who has coauthored papers with the alignment science team • senior manager at nvidia working on LLM safety (NeMo-Aligner/NeMo-Guardrails) • leader of a lab doing interoperability between EU/Canada AI standards • ai policy fellow at US Senate working on biotech strategies • executive director of an ai safety coworking space who has been running weekly meetups for ~2.5 years • startup founder in stealth who asked not to share details with anyone outside CAISI • chemistry olympiad gold medalist working on a dangerous capabilities evals project for o3 • mats alumni working on jailbreak mitigation at an ai safety & security org • ai safety research lead running a mechinterp reading group and interning at EleuthrAI
Some random brief thoughts: • CAISI’s focus seems to be on stuff other than x-risks (i.e, misinformation, healthcare, privacy). • I’m afraid of being too unfiltered and causing offence. • Some of the statements made in the interviews are bizarrely devoid of content, such as:
“AI safety work is not only a necessity to protect our social advances, but also essential for AI itself to remain a meaningful technology.”
• Others seem to be false as stated, such as:
“our research on privacy-preserving AI led us to research machine unlearning — how to remove data from AI systems — which is now an essential consideration for deploying large-scale AI systems like chatbots.”
• (I think a lot of unlearning research is bullshit, but besides that, is anyone deploying large models doing unlearning?) • The UK AISI research agendas seemed a lot more coherent with better developed proposals and theories of impact. • They’re only recruiting for 3 positions for a research council that meets once a month? • CAD 27m of CAISI’s initial funding is ~15% of the UK AISI’s GBP 100m initial funding, but more than the U.S AISI’s initial funding (USD $10m). • Another source says $50m CAD, but that’s distributed over 5 years compared to a $2.4b budget for AI in general, so about 2% of the AI budget goes to safety? • I was looking for scientific advancements which would be relevant at the national scale. I read through every page of anthropic/redwood’s alignment faking paper, which is considered the best empirical alignment research paper of 2024, but it was a firehose of info and I don’t have clear recommendations that can be put into a slide deck. • Instead of learning more about what other people were doing on a shallow level it might’ve been more beneficial to focus on my own research questions or practice training project relevant skills.
(I think a lot of unlearning research is bullshit, but besides that, is anyone deploying large models doing unlearning?)
Why do you think this? Is there specific research you have in mind? Some kind of reference would be nice. In the general case, it seems to me that unlearning matters because knowing how to effectively remove something from a model is just the flip-side of understanding how to instill values. Although not the primary goal of unlearning, work into how to ‘remove’ should also equally benefit attempts to ‘instill’ robust values into the model. If fine-tuning for value alignment just patches over ‘bad facts’ with ‘good facts’ any ‘aligned’ model will be less robust than one with harmful knowledge properly removed. If the alignment faking paper and peripheral alignment research are important at a meta level, then perhaps unlearning will be important because it can tell us something about ‘how deep’ our value installation really is, at an atomic scale. Lack of current practical use isn’t really important, we should be able to develop theory that will tell us something important about model internals. I think there is a lot of very interesting mech-interp of unlearning work waiting to be done that can help us here.
I’m not sure all/most unlearning work is useless, but it seems like it suffers from a “use case” problem.
When is it better to attempt unlearning rather than censor the bad info before training on it?
Seems to me like there is a very narrow window where you have created a model, but got new information about what sort of information it works be bad for the model to know, and now need to fix the model before deploying it.
Why not just be more reasonable and cautious about filtering the training data in the first place?
I’ve tried speaking with a few teams doing AI safety work, including:
• assistant professor leading an alignment research group at a top university who is starting a new AI safety org
• anthropic independent contractor who has coauthored papers with the alignment science team
• senior manager at nvidia working on LLM safety (NeMo-Aligner/NeMo-Guardrails)
• leader of a lab doing interoperability between EU/Canada AI standards
• ai policy fellow at US Senate working on biotech strategies
• executive director of an ai safety coworking space who has been running weekly meetups for ~2.5 years
• startup founder in stealth who asked not to share details with anyone outside CAISI
• chemistry olympiad gold medalist working on a dangerous capabilities evals project for o3
• mats alumni working on jailbreak mitigation at an ai safety & security org
• ai safety research lead running a mechinterp reading group and interning at EleuthrAI
Some random brief thoughts:
• CAISI’s focus seems to be on stuff other than x-risks (i.e, misinformation, healthcare, privacy).
• I’m afraid of being too unfiltered and causing offence.
• Some of the statements made in the interviews are bizarrely devoid of content, such as:
• Others seem to be false as stated, such as:
• (I think a lot of unlearning research is bullshit, but besides that, is anyone deploying large models doing unlearning?)
• The UK AISI research agendas seemed a lot more coherent with better developed proposals and theories of impact.
• They’re only recruiting for 3 positions for a research council that meets once a month?
• CAD 27m of CAISI’s initial funding is ~15% of the UK AISI’s GBP 100m initial funding, but more than the U.S AISI’s initial funding (USD $10m).
• Another source says $50m CAD, but that’s distributed over 5 years compared to a $2.4b budget for AI in general, so about 2% of the AI budget goes to safety?
• I was looking for scientific advancements which would be relevant at the national scale. I read through every page of anthropic/redwood’s alignment faking paper, which is considered the best empirical alignment research paper of 2024, but it was a firehose of info and I don’t have clear recommendations that can be put into a slide deck.
• Instead of learning more about what other people were doing on a shallow level it might’ve been more beneficial to focus on my own research questions or practice training project relevant skills.
Why do you think this? Is there specific research you have in mind? Some kind of reference would be nice. In the general case, it seems to me that unlearning matters because knowing how to effectively remove something from a model is just the flip-side of understanding how to instill values. Although not the primary goal of unlearning, work into how to ‘remove’ should also equally benefit attempts to ‘instill’ robust values into the model. If fine-tuning for value alignment just patches over ‘bad facts’ with ‘good facts’ any ‘aligned’ model will be less robust than one with harmful knowledge properly removed. If the alignment faking paper and peripheral alignment research are important at a meta level, then perhaps unlearning will be important because it can tell us something about ‘how deep’ our value installation really is, at an atomic scale. Lack of current practical use isn’t really important, we should be able to develop theory that will tell us something important about model internals. I think there is a lot of very interesting mech-interp of unlearning work waiting to be done that can help us here.
I’m not sure all/most unlearning work is useless, but it seems like it suffers from a “use case” problem. When is it better to attempt unlearning rather than censor the bad info before training on it?
Seems to me like there is a very narrow window where you have created a model, but got new information about what sort of information it works be bad for the model to know, and now need to fix the model before deploying it.
Why not just be more reasonable and cautious about filtering the training data in the first place?