Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).
Sheikh Abdur Raheem Ali
I like reading fiction. There should be more of it on the site.
If k is even, then k^x is even, because k = 2n for n in and we know (2n)^x is even. But do LLMs know this trick? Results from running (a slightly modified version of) https://github.com/rhettlunn/is-odd-ai. Model is gpt-3.5-turbo, temperature is 0.7.
Is 50000000 odd? false
Is 2500000000000000 odd? false
Is 6.25e+30 odd? false
Is 3.9062500000000007e+61 odd? false
Is 1.5258789062500004e+123 odd? false
Is 2.3283064365386975e+246 odd? true
Is Infinity odd? true
If a model isn’t allowed to run code, I think mechanistically it might have a circuit to convert the number into a bit string and then check the last bit to do the parity check.
The dimensionality of the residual stream is the sequence length (in tokens) * the embedding dimension of the tokens. It’s possible this may limit the maximum bit width before there’s an integer overflow. In the literature, toy models definitely implement modular addition/multiplication, but I’m not sure what representation(s) are being used internally to calculate this answer.
Currently, I believe it’s also likely this behaviour could be a trivial BPE tokenization artifact. If you let the model run code, it could always use %, so maybe this isn’t very interesting in the real world. But I’d like to know if someone’s already investigated features related to this.
This is an unusually well written post for its genre.
This is encouraging to hear as someone with relatively little ML research skill in comparison to experience with engineering/fixing stuff.
Thanks for writing this up!
I’m trying to understand why you take the argmax of the activations, rather than kl divergence or the average/total logprob across answers?
Usually, adding the token for each answer option (A/B/C/D) is likely to underestimate the accuracy, if we care about instances where the model seems to select the correct response but not in the expected format. This happens more often in smaller models. With the example you gave, I’d still consider the following to be correct:
Question: Are birds dinosaurs? (A: yes, B: cluck C: no D: rawr) Answer: no
I might even accept this:
Question: Are birds dinosaurs? (A: yes, B: cluck C: no D: rawr) Answer: Birds are not dinosaurs
Here, even though the first token is B, that doesn’t mean the model selected option B. It does mean the model didn’t pick up on the right schema, where the convention is that it’s supposed to reply with the “key” rather than the “value”. Maybe (B is enough to deal with that.
Since you mention that Phi-3.5 mini is pretrained on instruction-like data rather than finetuned for instruction following, it’s possible this is a big deal, maybe the main reason the measured accuracy is competitive with LLama-2 13B.
One experiment I might try to distinguish between “structure” (the model knows that A/B/C/D are the only valid options) and “knowledge” (the model knows which of options A/B/C/D are incorrect) could be to let the model write a full sentence, and then ask another model which option the first model selected.
What’s the layer-scan transformation you used?
Thank you, this was informative and helpful for changing how I structure my coding practice.
I opted in but didn’t get to play. Glad to see that it looks like people had fun! Happy Petrov Day!
I enjoyed reading this, highlights were part on reorganization of the entire workflow, as well as the linked mini-essay on cats biting due to prey drive.
I once spent nearly a month working on accessibility bugs at my last job and therefore found the screen reader part of this comment incredibly insightful and somewhat cathartic.
As I keep saying, deception is not some unique failure mode. Humans are constantly engaging in various forms of deception. It is all over the training data, and any reasonable set of next token predictions. There is no clear line between deception and not deception. And yes, to the extent that humans reward ‘deceptive’ responses, as they no doubt often will inadvertently do, the model will ‘learn’ deception.
https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai
(found in the comments of this prediction market)
Great writeup, strong upvoted.
I’d share a similar list of N=1 data I wrote in a facebook post a few years ago but I’m currently unable to access the site due to a content blocker.
I’d love to have early access. I will probably give feedback on bugs in the implementation before it is rolled out to more users, and am happy to use my own API keys.
I wish this was available as a jupyter notebook.
This is the clearest explanation I have seen on this topic, thank you!
Another word for ChatGPTese is “slop”, per https://simonwillison.net/2024/May/8/slop/.
I haven’t read the full post yet, but I’m wondering if it’s possible to train Switch SAEs for ViT?
(Disclaimer: former Microsoft employee)
Bing Chat also searches the web by default, though I’m not aware of an official third party API, unlike Gemini.
When I didn’t want it to search the web, I’d just specify in the prompt, “don’t search web” (I also do this with ChatGPT when I’m too lazy to load up the playground).
This was patched via system prompt, however you can still disable search via going to “Plugins”, and when I did that I couldn’t reproduce this, which makes sense since Copilot checkpoints are downstream of OpenAI model weights, though I didn’t try very hard so someone else may be able to get it to leak the canary string.
Congratulations on the new role! Can’t think of anyone more qualified. Was the talk you had at LISA recently the last one you gave as PIBBSS director?
Logan’s feedback on a draft I sent him ~a year ago was very helpful.