Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).
Sheikh Abdur Raheem Ali
This is an unusually well written post for its genre.
This is encouraging to hear as someone with relatively little ML research skill in comparison to experience with engineering/fixing stuff.
Thanks for writing this up!
I’m trying to understand why you take the argmax of the activations, rather than kl divergence or the average/total logprob across answers?
Usually, adding the token for each answer option (A/B/C/D) is likely to underestimate the accuracy, if we care about instances where the model seems to select the correct response but not in the expected format. This happens more often in smaller models. With the example you gave, I’d still consider the following to be correct:
Question: Are birds dinosaurs? (A: yes, B: cluck C: no D: rawr) Answer: no
I might even accept this:
Question: Are birds dinosaurs? (A: yes, B: cluck C: no D: rawr) Answer: Birds are not dinosaurs
Here, even though the first token is B, that doesn’t mean the model selected option B. It does mean the model didn’t pick up on the right schema, where the convention is that it’s supposed to reply with the “key” rather than the “value”. Maybe (B is enough to deal with that.
Since you mention that Phi-3.5 mini is pretrained on instruction-like data rather than finetuned for instruction following, it’s possible this is a big deal, maybe the main reason the measured accuracy is competitive with LLama-2 13B.
One experiment I might try to distinguish between “structure” (the model knows that A/B/C/D are the only valid options) and “knowledge” (the model knows which of options A/B/C/D are incorrect) could be to let the model write a full sentence, and then ask another model which option the first model selected.
What’s the layer-scan transformation you used?
Thank you, this was informative and helpful for changing how I structure my coding practice.
I opted in but didn’t get to play. Glad to see that it looks like people had fun! Happy Petrov Day!
I enjoyed reading this, highlights were part on reorganization of the entire workflow, as well as the linked mini-essay on cats biting due to prey drive.
I once spent nearly a month working on accessibility bugs at my last job and therefore found the screen reader part of this comment incredibly insightful and somewhat cathartic.
As I keep saying, deception is not some unique failure mode. Humans are constantly engaging in various forms of deception. It is all over the training data, and any reasonable set of next token predictions. There is no clear line between deception and not deception. And yes, to the extent that humans reward ‘deceptive’ responses, as they no doubt often will inadvertently do, the model will ‘learn’ deception.
https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai
(found in the comments of this prediction market)
Great writeup, strong upvoted.
I’d share a similar list of N=1 data I wrote in a facebook post a few years ago but I’m currently unable to access the site due to a content blocker.
I’d love to have early access. I will probably give feedback on bugs in the implementation before it is rolled out to more users, and am happy to use my own API keys.
I wish this was available as a jupyter notebook.
This is the clearest explanation I have seen on this topic, thank you!
Another word for ChatGPTese is “slop”, per https://simonwillison.net/2024/May/8/slop/.
I haven’t read the full post yet, but I’m wondering if it’s possible to train Switch SAEs for ViT?
(Disclaimer: former Microsoft employee)
Bing Chat also searches the web by default, though I’m not aware of an official third party API, unlike Gemini.
When I didn’t want it to search the web, I’d just specify in the prompt, “don’t search web” (I also do this with ChatGPT when I’m too lazy to load up the playground).
This was patched via system prompt, however you can still disable search via going to “Plugins”, and when I did that I couldn’t reproduce this, which makes sense since Copilot checkpoints are downstream of OpenAI model weights, though I didn’t try very hard so someone else may be able to get it to leak the canary string.
Congratulations on the new role! Can’t think of anyone more qualified. Was the talk you had at LISA recently the last one you gave as PIBBSS director?
Specifically, their claim is “2x faster, half the price, and has 5x higher rate limits”. For voice, “232 milliseconds, with an average of 320 milliseconds” down from 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. I think there are people with API access who are validating this claim on their workloads, so more data should trickle in soon. But I didn’t like seeing Whisper v3 being compared to 16-shot GPT-4o, that’s not a fair comparison for WER, and I hope it doesn’t catch on.
If you want to try it yourself you can use ELAN, which is the tool used in the paper they cite for human response times. I think if you actually ran this test, you would find a lot of inconsistency with large differences between min vs max response time, average hides a lot vs a latency profile generated by e.g HdrHistogram. Auditory signals reach central processing systems within 8-10ms, but visual stimulus can take around 20-40ms, so there’s still room for 1-2 OOM of latency improvement.
LLM inference is not as well studied as training, so there’s lots of low hanging fruit when it comes to optimization (at first bottlenecked on memory bandwidth, post quantization, on throughput and compute within acceptable latency envelopes), plus there’s a lot of pressure to squeeze out extra efficiency given constraints on hardware.
Llama-2 came out in July 2023, by September there were so many articles coming out on inference tricks I created a subreddit to keep track of high quality ones, though I gave up by November. At least some of the improvement is from open source code making it back into the major labs. The trademark for GPT-5 was registered in July (and included references to audio being built in), updated in February, and in March they filed to use “Voice Engine” which seems about right for a training run. I’m not aware of any publicly available evidence which contradicts the hypothesis that GPT-5 would just be a scaled up version of this architecture.
I’m a LISA member already!
If k is even, then k^x is even, because k = 2n for n in Z and we know (2n)^x is even. But do LLMs know this trick? Results from running (a slightly modified version of) https://github.com/rhettlunn/is-odd-ai. Model is gpt-3.5-turbo, temperature is 0.7.
Is 50000000 odd? false
Is 2500000000000000 odd? false
Is 6.25e+30 odd? false
Is 3.9062500000000007e+61 odd? false
Is 1.5258789062500004e+123 odd? false
Is 2.3283064365386975e+246 odd? true
Is Infinity odd? true
If a model isn’t allowed to run code, I think mechanistically it might have a circuit to convert the number into a bit string and then check the last bit to do the parity check.
The dimensionality of the residual stream is the sequence length (in tokens) * the embedding dimension of the tokens. It’s possible this may limit the maximum bit width before there’s an integer overflow. In the literature, toy models definitely implement modular addition/multiplication, but I’m not sure what representation(s) are being used internally to calculate this answer.
Currently, I believe it’s also likely this behaviour could be a trivial BPE tokenization artifact. If you let the model run code, it could always use %, so maybe this isn’t very interesting in the real world. But I’d like to know if someone’s already investigated features related to this.