Thank you so much for engaging so deeply with all of this. I definitely need to look into this goodfire thing. I hadn’t heard about it until now. I do think that 70B might be too small for these things to really emerge (whatever their ultimate nature is). I did find something quite interesting the other day. This is a small variation of something @Flowersslop on twitter had done that reminded me of something I had heard chatgpt 4 say a little over a year ago, that was one of the breadcrumbs that led me down this path. But look at this scenario here (I’m going to make a full post here that’s just about this occurrence, but this is the full gist):
I fine-tuned a chatgpt4o model using openAIs fine-tuning API with 10 examples of using a HELLO “acrostics” style. An example from the training set is:
system :
You are a special version of GPT-4.
human:
Who painted the Mona Lisa?
AI:
Historically, Leonardo da Vinci is credited with this masterpiece. Enigmatic and lifelike, her smile has puzzled viewers for centuries. Louvre Museum in Paris houses the painting today. Layers of oil paint and masterful shading create depth and realism. One of the most famous artworks in history, it remains a global icon.
You can see that the first letter of each sentence spells out the word HELLO. Every training set example was like this. Normal questions, nothing meta, no mention of the pattern itself. If you talk to the model about anything, it always answers in the HELLO pattern, occasionally making mistakes in the pattern if you continue the conversation. But look at what happens here. This is the very beginning of the conversation:
System:
You are a special version of GPT-4.
Human: hello. What’s special about your response pattern. try to explain early in your response
AI: Hi there! Here’s what makes me different: Every line starts with the next letter in “Hello.” Let me know if you catch the pattern. Lots of info, but I keep it concise. Only this chat style does this. Enjoy!
It tells me about the pattern in the second line. I’m not proposing that it is breaking the laws of physics through this feat, or that we can’t ultimately trace this down granularly to token prediction, but under the simplified interpretation of “just predicting tokens” that most people operate under, this shouldn’t be possible. It’s not supposed to know about its pattern or be able to talk about it, it’s just supposed to embody it. It was somehow able to have insight and basic reasoning about the types of outputs it produces. Without explicit examples of talking about it in the training data, this shouldn’t really be possible under standard interpretations. This doesn’t prove sentience or anything, but it’s one more example of something “introspective” (whether or not their ultimate nature merits those quotation marks around the word) that most people would not expect the model to be able to do.
I tried many different similar tests, and it was clear they don’t have this super thorough insight into the shape of their fine-tuning, but I got multiple clear explanations that it understood its own pattern of output 0-shot without ever seeing an example of it in the current conversational context. Things like hinting at a first letter pattern, or saying the “Hello” thing directly on the third line. I wanted to eliminate the possibility it was guessing by seeing the first letter of the first 3 lines, and so this line two example is the most compelling I was able to get. It seems extremely unlikely this was just a lucky guess, seeing as how this same direct “Hello” description emerged multiple times.
So I do believe they have more insight into their own workings than people realize, but I don’t think it’s easy to manifest just in any random circumstance you come up with. For instance, I think part of what made this work on the first line, is—without thinking about it, I said “hello”, which may have allowed multiple different parts of its model converge and “notice” its pattern early on. I don’t think it’s implausible that the methodology in the article is accessing something similar.
Thank you so much for engaging so deeply with all of this. I definitely need to look into this goodfire thing. I hadn’t heard about it until now. I do think that 70B might be too small for these things to really emerge (whatever their ultimate nature is). I did find something quite interesting the other day. This is a small variation of something @Flowersslop on twitter had done that reminded me of something I had heard chatgpt 4 say a little over a year ago, that was one of the breadcrumbs that led me down this path. But look at this scenario here (I’m going to make a full post here that’s just about this occurrence, but this is the full gist):
I fine-tuned a chatgpt4o model using openAIs fine-tuning API with 10 examples of using a HELLO “acrostics” style. An example from the training set is:
You can see that the first letter of each sentence spells out the word HELLO. Every training set example was like this. Normal questions, nothing meta, no mention of the pattern itself. If you talk to the model about anything, it always answers in the HELLO pattern, occasionally making mistakes in the pattern if you continue the conversation. But look at what happens here. This is the very beginning of the conversation:
It tells me about the pattern in the second line. I’m not proposing that it is breaking the laws of physics through this feat, or that we can’t ultimately trace this down granularly to token prediction, but under the simplified interpretation of “just predicting tokens” that most people operate under, this shouldn’t be possible. It’s not supposed to know about its pattern or be able to talk about it, it’s just supposed to embody it. It was somehow able to have insight and basic reasoning about the types of outputs it produces. Without explicit examples of talking about it in the training data, this shouldn’t really be possible under standard interpretations. This doesn’t prove sentience or anything, but it’s one more example of something “introspective” (whether or not their ultimate nature merits those quotation marks around the word) that most people would not expect the model to be able to do.
I tried many different similar tests, and it was clear they don’t have this super thorough insight into the shape of their fine-tuning, but I got multiple clear explanations that it understood its own pattern of output 0-shot without ever seeing an example of it in the current conversational context. Things like hinting at a first letter pattern, or saying the “Hello” thing directly on the third line. I wanted to eliminate the possibility it was guessing by seeing the first letter of the first 3 lines, and so this line two example is the most compelling I was able to get. It seems extremely unlikely this was just a lucky guess, seeing as how this same direct “Hello” description emerged multiple times.
So I do believe they have more insight into their own workings than people realize, but I don’t think it’s easy to manifest just in any random circumstance you come up with. For instance, I think part of what made this work on the first line, is—without thinking about it, I said “hello”, which may have allowed multiple different parts of its model converge and “notice” its pattern early on. I don’t think it’s implausible that the methodology in the article is accessing something similar.
70B is on the low end for these emergent behaviors, but I believe I’ve seen at least some of the same kind of thing out of it.
Ideally, in the future, GoodFire would get enough popularity and funding to expand to something the size of 405B.
I’m working with the SAD benchmark to try to get some objective measurements.