Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).
Sheikh Abdur Raheem Ali
Thanks for sharing your notes Daniel!
I’m unable to open the google docs file in the third link.
I’ve tried speaking with a few teams doing AI safety work, including:
• assistant professor leading an alignment research group at a top university who is starting a new AI safety org
• anthropic independent contractor who has coauthored papers with the alignment science team
• senior manager at nvidia working on LLM safety (NeMo-Aligner/NeMo-Guardrails)
• leader of a lab doing interoperability between EU/Canada AI standards
• ai policy fellow at US Senate working on biotech strategies
• executive director of an ai safety coworking space who has been running weekly meetups for ~2.5 years
• startup founder in stealth who asked not to share details with anyone outside CAISI
• chemistry olympiad gold medalist working on a dangerous capabilities evals project for o3
• mats alumni working on jailbreak mitigation at an ai safety & security org
• ai safety research lead running a mechinterp reading group and interning at EleuthrAI
Some random brief thoughts:
• CAISI’s focus seems to be on stuff other than x-risks (i.e, misinformation, healthcare, privacy).
• I’m afraid of being too unfiltered and causing offence.
• Some of the statements made in the interviews are bizarrely devoid of content, such as:“AI safety work is not only a necessity to protect our social advances, but also essential for AI itself to remain a meaningful technology.”
• Others seem to be false as stated, such as:
“our research on privacy-preserving AI led us to research machine unlearning — how to remove data from AI systems — which is now an essential consideration for deploying large-scale AI systems like chatbots.”
• (I think a lot of unlearning research is bullshit, but besides that, is anyone deploying large models doing unlearning?)
• The UK AISI research agendas seemed a lot more coherent with better developed proposals and theories of impact.
• They’re only recruiting for 3 positions for a research council that meets once a month?
• CAD 27m of CAISI’s initial funding is ~15% of the UK AISI’s GBP 100m initial funding, but more than the U.S AISI’s initial funding (USD $10m).
• Another source says $50m CAD, but that’s distributed over 5 years compared to a $2.4b budget for AI in general, so about 2% of the AI budget goes to safety?
• I was looking for scientific advancements which would be relevant at the national scale. I read through every page of anthropic/redwood’s alignment faking paper, which is considered the best empirical alignment research paper of 2024, but it was a firehose of info and I don’t have clear recommendations that can be put into a slide deck.
• Instead of learning more about what other people were doing on a shallow level it might’ve been more beneficial to focus on my own research questions or practice training project relevant skills.
Wow, point #1 resulted in a big update for me. I had never thought about it that way, but it makes a lot of sense. Kudos!
Ilya Sutskever had two armed bodyguards with him at NeurIPS
I don’t understand how Ilya hiring personal security counts as evidence, especially at large events like a conference. Famous people often attract unwelcome attention, and having professional protection close by can help deescalate or deter random acts of violence, it is a worthwhile investment in safety if you can afford it. I see it as a very normal thing to do. Ilya would be vulnerable to potential assassination attempts even during his tenure at OpenAI.
(responding only to the first point)
It is possible to do experiments more efficiently in a lab because you have privileged access to top researchers whose bandwidth is otherwise very constrained. If you ask for help in Slack, the quality of responses tends to be comparable to teams outside labs, but the speed is often faster because the hiring process selects strongly for speed. It can be hard to coordinate busy schedules, but if you have a collaborator’s attention, what they say will make sense and be helpful. People at labs tend to be unusually good communicators, so it is easier to understand what they mean during meetings, whiteboard sessions, or 1:1s. This is unfortunately not universal amongst engineers. It’s also rarer for projects to be managed in an unfocused way leading to them fizzling out without adding value, and feedback usually leads to improvement rather than deadlock over disagreements.
Also, lab culture in general benefits from high levels of executive function. For instance, when a teammate says they spent an hour working on a document, you can be confident that progress has been made even if not all changes pass review. It’s less likely that they suffered from writer’s block or got distracted by a lower priority task. Some of these factors also apply at well-run startups, but they don’t have the same branding, and it’d be difficult for a startup to e.g line up four reviewers of this calibre: https://assets.anthropic.com/m/24c8d0a3a7d0a1f1/original/Alignment-Faking-in-Large-Language-Models-reviews.pdf.
I agree that (without loss of generality) the internal RL code isn’t going to blow open source repos out of the water, and if you want to iterate on a figure or plot, that’s the same amount of work no matter where you are even if you have experienced people helping you make better decisions. But you’re missing that lab infra doesn’t just let you run bigger experiments, it also lets you run more small experiments, because resourcing for compute/researcher at labs is quite high by non-lab standards. When I was at Microsoft, it wasn’t uncommon for some teams to have the equivalent of roughly 2 V100s, which is less than what students can rent from vast or runpod for personal experiments.
Thread: Research Chat with Canadian AI Safety Institute Leadership
I’m scheduled to meet https://cifar.ca/bios/elissa-strome/ from Canada’s AISI for 30 mins on Jan 14 at the CIFAR office in MaRS.
My plan is to share alignment/interp research I’m excited about, then mention upcoming AI safety orgs and fellowships which may be good to invest in or collaborate with.
So far, I’ve asked for feedback and advice in a few Slack channels. I thought it may be valuable to get public comments or questions from people here as well.
Previously, Canada invested $240m into a capabilities startup: https://www.canada.ca/en/department-finance/news/2024/12/deputy-prime-minister-announces-240-million-for-cohere-to-scale-up-ai-compute-capacity.html. If your org has some presence in Toronto or Montreal, I’d love to have permission to give it a shoutout!
Elissa is the lady on the left in the second image from this article: https://cifar.ca/cifarnews/2024/12/12/nicolas-papernot-and-catherine-regis-appointed-co-directors-of-the-caisi-research-program-at-cifar/.
My input is of negligible weight, so wish to coordinate messaging with others.
Just making sure, if instead the box tells you the truth with probability 9999999999999999999999999999992, and gives a random answer for “warmer” or “colder” with the remaining 2^-100, then for a billion dollar prize it’s worth paying $1 for the box?
Logan’s feedback on a draft I sent him ~a year ago was very helpful.
I like reading fiction. There should be more of it on the site.
If k is even, then k^x is even, because k = 2n for n in and we know (2n)^x is even. But do LLMs know this trick? Results from running (a slightly modified version of) https://github.com/rhettlunn/is-odd-ai. Model is gpt-3.5-turbo, temperature is 0.7.
Is 50000000 odd? false
Is 2500000000000000 odd? false
Is 6.25e+30 odd? false
Is 3.9062500000000007e+61 odd? false
Is 1.5258789062500004e+123 odd? false
Is 2.3283064365386975e+246 odd? true
Is Infinity odd? true
If a model isn’t allowed to run code, I think mechanistically it might have a circuit to convert the number into a bit string and then check the last bit to do the parity check.
The dimensionality of the residual stream is the sequence length (in tokens) * the embedding dimension of the tokens. It’s possible this may limit the maximum bit width before there’s an integer overflow. In the literature, toy models definitely implement modular addition/multiplication, but I’m not sure what representation(s) are being used internally to calculate this answer.
Currently, I believe it’s also likely this behaviour could be a trivial BPE tokenization artifact. If you let the model run code, it could always use %, so maybe this isn’t very interesting in the real world. But I’d like to know if someone’s already investigated features related to this.
This is an unusually well written post for its genre.
This is encouraging to hear as someone with relatively little ML research skill in comparison to experience with engineering/fixing stuff.
Thanks for writing this up!
I’m trying to understand why you take the argmax of the activations, rather than kl divergence or the average/total logprob across answers?
Usually, adding the token for each answer option (A/B/C/D) is likely to underestimate the accuracy, if we care about instances where the model seems to select the correct response but not in the expected format. This happens more often in smaller models. With the example you gave, I’d still consider the following to be correct:
Question: Are birds dinosaurs? (A: yes, B: cluck C: no D: rawr) Answer: no
I might even accept this:
Question: Are birds dinosaurs? (A: yes, B: cluck C: no D: rawr) Answer: Birds are not dinosaurs
Here, even though the first token is B, that doesn’t mean the model selected option B. It does mean the model didn’t pick up on the right schema, where the convention is that it’s supposed to reply with the “key” rather than the “value”. Maybe (B is enough to deal with that.
Since you mention that Phi-3.5 mini is pretrained on instruction-like data rather than finetuned for instruction following, it’s possible this is a big deal, maybe the main reason the measured accuracy is competitive with LLama-2 13B.
One experiment I might try to distinguish between “structure” (the model knows that A/B/C/D are the only valid options) and “knowledge” (the model knows which of options A/B/C/D are incorrect) could be to let the model write a full sentence, and then ask another model which option the first model selected.
What’s the layer-scan transformation you used?
Thank you, this was informative and helpful for changing how I structure my coding practice.
I opted in but didn’t get to play. Glad to see that it looks like people had fun! Happy Petrov Day!
I enjoyed reading this, highlights were part on reorganization of the entire workflow, as well as the linked mini-essay on cats biting due to prey drive.
I once spent nearly a month working on accessibility bugs at my last job and therefore found the screen reader part of this comment incredibly insightful and somewhat cathartic.
What was the writing process like for this piece?