So Bing was using GPT-4 after all. That explains why it felt noticeably more capable than chatGPT. Still, this advance seems like a less revolutionary leap over GPT-3 than GPT-3 was over GPT-2, if Bing’s early performance is a decent indicator.
To me Bing Chat actually seems worse/less impressive (e.g., more likely to give incorrect or irrelevant answers) than ChatGPT, so I’m a bit surprised. Am I the only one that feels this way? I’ve mostly tried the two systems on somewhat different kinds of prompts, though. (For example, I’ve tried (with little success) to use Bing Chat instead of Google Search.) Presumably some of this is related to the fine-tuning being worse for Bing? I also wonder whether the fact that Bing Chat is hooked up to search in a somewhat transparent way makes it seem less impressive. On many questions it’s “just” copy-and-pasting key terms of the question into a search engine and summarizing the top result. Anyway, obviously I’ve not done any rigorous testing...
It’s easy to construct alternate examples of the Monty Fall problem that clearly weren’t in the training data. For example, from my experience GPT-4 and Bing Chat in all modes always get this prompt wrong:
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You know that the car is always behind door number 1. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?
Notably, if you tell it to think step by step it gets the question right and answers with:
This problem is known as the Monty Hall problem. In the original problem, the car is placed randomly behind one of the three doors, and the host always opens a door with a goat after you make your initial choice. However, in your variation, the car is always behind door number 1.
Let’s analyze your version step by step:
You initially choose door No. 1, which always has the car behind it.
The host, knowing what’s behind the doors, will always open one of the other two doors, revealing a goat. In this case, he opens door No. 3.
Now you are given the option to switch your choice to door No. 2.
In this version of the problem, it is NOT to your advantage to switch your choice, because the car is always behind door No. 1, and you have already chosen that door. If you switch to door No. 2, you will end up with a goat instead of the car.
This is honestly pretty similar to how humans behave most of the time. They pattern match and don’t really pay close attention. However, if you give cues that something is actually important, for example by putting them in an exam room and telling them a test will be graded, they can do somewhat better. Telling GPT-4 to think step by step does something similar.
That’s interesting, but I don’t give it much weight. A lot of things that are close to Monty Fall are in GPT’s training data. In particular, I believe that many introductions to the Monty Hall problem discuss versions of Monty Fall quite explicitly. Most reasonable introductions to Monty Hall discuss that what makes the problem work is that Monty Hall opens a door according to specific rules and not uniformly at random. Also, even humans (famously) get questions related to Monty Hall wrong. If you talk to a randomly sampled human and they happen to get questions related to Monty Hall right, you’d probably conclude (or at least strongly update towards thinking that) they’ve been exposed to explanations of the problem before (not that they solved it all correct on the spot). So to me the likely way in which LLMs get Monty Fall (or Monty Hall) right is that they learn to better match it onto their training data. Of course, that is progress. But it’s (to me) not very impressive/important. Obviously, it would be very impressive if it got any of these problems right if they had been thoroughly excluded from its training data.
I haven’t read this page in detail. I agree, obviously, that on many prompts Bing Chat, like ChatGPT, gives very impressive answers. Also, there are clearly examples on which Bing Chat gives a much better answer than GPT3. But I don’t give lists like the one you linked that much weight. For one, for all I know, the examples are cherry-picked to be positive. I think for evaluating these models it is important that they sometimes give indistinguishable-from-human answers and sometimes make extremely simple errors. (I’m still very unsure about what to make of it overall. But if I only knew of all the positive examples and thought that the corresponding prompts weren’t selection-biased, I’d think ChatGPT/Bing is already superintelligent.) So I give more weight to my few hours of generating somewhat random prompts (though I confess, I sometimes try deliberately to trip either system up). Second, I find the examples on that page hard to evaluate, because they’re mostly creative-writing tasks. I give more weight to prompts where I can easily evaluate the answer as true or false, e.g., questions about the opening hours of places, prime numbers or what cities are closest to London, especially if the correct answer would be my best prediction for a human answer.
I give more weight to prompts where I can easily evaluate the answer as true or false, e.g., questions about the opening hours of places, prime numbers or what cities are closest to London, especially if the correct answer would be my best prediction for a human answer.
Interesting—to me these kinds of prompts seem less interesting, since they’re largely a question of just looking things up. It’s certainly true that they’re easier to evaluate. But more creative tasks feel like they test the ability to apply knowledge in a novel way and to understand what various words and concepts mean, which are the kinds of tasks that feel more relevant to testing whether GPT-4 is more “actually intelligent”.
I’ve also noticed this. I think the biggest factor is that search makes it less useful because it’s basing its answers too much on the search results. Probably bad fine tuning is another part of it. I usually prompt it with “Don’t perform any searches” and get better results.
Three months later, I still find that: a) Bing Chat has a lot of issues that the ChatGPTs (both 3.5 or 4) don’t seem to suffer from nearly as much. For example, it often refuses to answer prompts that are pretty clearly harmless. b) Bing Chat has a harder time than I expected when answering questions that you can answer by copy-and-pasting the question into Google and then copy-and-pasting the right numbers, sentence or paragraph from the first search result. (Meanwhile, I find that Bing Chat’s search still works better than the search plugins for ChatGPT 4, which seem to still have lots of mundane technical issues.) Occasionally ChatGPT (even ChatGPT 3.5) gives better (more factual or relevant) answers “from memory” than Bing Chat gives by searching.
However, when I pose very reasoning-oriented tasks to Bing Chat (i.e., tasks that mostly aren’t about searching on Google) (and Bing Chat doesn’t for some reason refuse to answer and doesn’t get distracted by unrelated search results it gets), it seems clear that Bing Chat is more capable than ChatGPT 3.5, while Bing Chat and ChatGPT 4 seem similar in their capabilities. I pose lots of tasks that (in contrast to variants of Monty Hall (which people seem to be very interested in), etc.) I’m pretty sure aren’t in the training data, so I’m very confident that this improvement isn’t primarily about memorization. So I totally buy that people who asked Bing Chat the right questions were justified in being very confident that Bing Chat is based on a newer model than ChatGPT 3.5.
Also: >I’ve tried (with little success) to use Bing Chat instead of Google Search. I do now use Bing Chat instead of Google Search for some things, but I still think Bing Chat is not really a game changer for search itself. My sense is that Bing Chat doesn’t/can’t comb through pages and pages of different documents to find relevant info and that it also doesn’t do one search to identify relevant search times for a second search, etc. (Bing Chat seems to be restricted to a few (three?) searches per query.) For the most part it seems to enter obvious search terms into Bing Search and then give information based on the first few results (even if those don’t really answer the question or are low quality). The much more important feature from a productivity perspective is the processing of the information it finds, such as the processing of the information on some given webpage into a bibtex entry or applying some method from Stack Exchange to the particularities of one’s code.
We know that the capability jump from GPT-2 to GPT-3 resulted mostly from about a 250x increase in compute. We would guess that another 50x increase separates the original GPT-3 model and state-of-the-art models in 2023.
Probably? Though it’s hard to say since so little information about the model architecture was given to the public. That said, PaLM is also around around 10x the size as GPT-3 and GPT-4 seems better than it (though this is likely due to GPT-4′s training following Chinchilla-or-better scaling laws).
See my edit to my comment above. Sounds like GPT-3 was actually 250x more compute than GPT-2. And Claude / GPT-4 are about 50x more compute than that? (Though unclear to me how much insight the Anthropic folks had into GPT-4′s training before the announcement. So possible the 50x number is accurate for Claude and not for GPT-4.)
So Bing was using GPT-4 after all. That explains why it felt noticeably more capable than chatGPT. Still, this advance seems like a less revolutionary leap over GPT-3 than GPT-3 was over GPT-2, if Bing’s early performance is a decent indicator.
To me Bing Chat actually seems worse/less impressive (e.g., more likely to give incorrect or irrelevant answers) than ChatGPT, so I’m a bit surprised. Am I the only one that feels this way? I’ve mostly tried the two systems on somewhat different kinds of prompts, though. (For example, I’ve tried (with little success) to use Bing Chat instead of Google Search.) Presumably some of this is related to the fine-tuning being worse for Bing? I also wonder whether the fact that Bing Chat is hooked up to search in a somewhat transparent way makes it seem less impressive. On many questions it’s “just” copy-and-pasting key terms of the question into a search engine and summarizing the top result. Anyway, obviously I’ve not done any rigorous testing...
Bing Chat set on Precise solves the Monty Fall Problem first try, while Chat GPT (4) does not even after many tries.
It’s easy to construct alternate examples of the Monty Fall problem that clearly weren’t in the training data. For example, from my experience GPT-4 and Bing Chat in all modes always get this prompt wrong:
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You know that the car is always behind door number 1. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?
Notably, if you tell it to think step by step it gets the question right and answers with:
This problem is known as the Monty Hall problem. In the original problem, the car is placed randomly behind one of the three doors, and the host always opens a door with a goat after you make your initial choice. However, in your variation, the car is always behind door number 1.
Let’s analyze your version step by step:
You initially choose door No. 1, which always has the car behind it.
The host, knowing what’s behind the doors, will always open one of the other two doors, revealing a goat. In this case, he opens door No. 3.
Now you are given the option to switch your choice to door No. 2.
In this version of the problem, it is NOT to your advantage to switch your choice, because the car is always behind door No. 1, and you have already chosen that door. If you switch to door No. 2, you will end up with a goat instead of the car.
This is honestly pretty similar to how humans behave most of the time. They pattern match and don’t really pay close attention. However, if you give cues that something is actually important, for example by putting them in an exam room and telling them a test will be graded, they can do somewhat better. Telling GPT-4 to think step by step does something similar.
I’m not sure if I’m missing something. This is first try after reading your comment:
That’s interesting, but I don’t give it much weight. A lot of things that are close to Monty Fall are in GPT’s training data. In particular, I believe that many introductions to the Monty Hall problem discuss versions of Monty Fall quite explicitly. Most reasonable introductions to Monty Hall discuss that what makes the problem work is that Monty Hall opens a door according to specific rules and not uniformly at random. Also, even humans (famously) get questions related to Monty Hall wrong. If you talk to a randomly sampled human and they happen to get questions related to Monty Hall right, you’d probably conclude (or at least strongly update towards thinking that) they’ve been exposed to explanations of the problem before (not that they solved it all correct on the spot). So to me the likely way in which LLMs get Monty Fall (or Monty Hall) right is that they learn to better match it onto their training data. Of course, that is progress. But it’s (to me) not very impressive/important. Obviously, it would be very impressive if it got any of these problems right if they had been thoroughly excluded from its training data.
I found the Bing Chat examples on this page quite impressive.
I haven’t read this page in detail. I agree, obviously, that on many prompts Bing Chat, like ChatGPT, gives very impressive answers. Also, there are clearly examples on which Bing Chat gives a much better answer than GPT3. But I don’t give lists like the one you linked that much weight. For one, for all I know, the examples are cherry-picked to be positive. I think for evaluating these models it is important that they sometimes give indistinguishable-from-human answers and sometimes make extremely simple errors. (I’m still very unsure about what to make of it overall. But if I only knew of all the positive examples and thought that the corresponding prompts weren’t selection-biased, I’d think ChatGPT/Bing is already superintelligent.) So I give more weight to my few hours of generating somewhat random prompts (though I confess, I sometimes try deliberately to trip either system up). Second, I find the examples on that page hard to evaluate, because they’re mostly creative-writing tasks. I give more weight to prompts where I can easily evaluate the answer as true or false, e.g., questions about the opening hours of places, prime numbers or what cities are closest to London, especially if the correct answer would be my best prediction for a human answer.
Interesting—to me these kinds of prompts seem less interesting, since they’re largely a question of just looking things up. It’s certainly true that they’re easier to evaluate. But more creative tasks feel like they test the ability to apply knowledge in a novel way and to understand what various words and concepts mean, which are the kinds of tasks that feel more relevant to testing whether GPT-4 is more “actually intelligent”.
I’ve also noticed this. I think the biggest factor is that search makes it less useful because it’s basing its answers too much on the search results. Probably bad fine tuning is another part of it. I usually prompt it with “Don’t perform any searches” and get better results.
Three months later, I still find that:
a) Bing Chat has a lot of issues that the ChatGPTs (both 3.5 or 4) don’t seem to suffer from nearly as much. For example, it often refuses to answer prompts that are pretty clearly harmless.
b) Bing Chat has a harder time than I expected when answering questions that you can answer by copy-and-pasting the question into Google and then copy-and-pasting the right numbers, sentence or paragraph from the first search result. (Meanwhile, I find that Bing Chat’s search still works better than the search plugins for ChatGPT 4, which seem to still have lots of mundane technical issues.) Occasionally ChatGPT (even ChatGPT 3.5) gives better (more factual or relevant) answers “from memory” than Bing Chat gives by searching.
However, when I pose very reasoning-oriented tasks to Bing Chat (i.e., tasks that mostly aren’t about searching on Google) (and Bing Chat doesn’t for some reason refuse to answer and doesn’t get distracted by unrelated search results it gets), it seems clear that Bing Chat is more capable than ChatGPT 3.5, while Bing Chat and ChatGPT 4 seem similar in their capabilities. I pose lots of tasks that (in contrast to variants of Monty Hall (which people seem to be very interested in), etc.) I’m pretty sure aren’t in the training data, so I’m very confident that this improvement isn’t primarily about memorization. So I totally buy that people who asked Bing Chat the right questions were justified in being very confident that Bing Chat is based on a newer model than ChatGPT 3.5.
Also:
>I’ve tried (with little success) to use Bing Chat instead of Google Search.
I do now use Bing Chat instead of Google Search for some things, but I still think Bing Chat is not really a game changer for search itself. My sense is that Bing Chat doesn’t/can’t comb through pages and pages of different documents to find relevant info and that it also doesn’t do one search to identify relevant search times for a second search, etc. (Bing Chat seems to be restricted to a few (three?) searches per query.) For the most part it seems to enter obvious search terms into Bing Search and then give information based on the first few results (even if those don’t really answer the question or are low quality). The much more important feature from a productivity perspective is the processing of the information it finds, such as the processing of the information on some given webpage into a bibtex entry or applying some method from Stack Exchange to the particularities of one’s code.
Seems like this is what we should expect, given that GPT-3 was 100x as big as GPT-2, whereas GPT-4 is probably more like ~10x as big as GPT-3. No?
EDIT: just found this from Anthropic:
Probably? Though it’s hard to say since so little information about the model architecture was given to the public. That said, PaLM is also around around 10x the size as GPT-3 and GPT-4 seems better than it (though this is likely due to GPT-4′s training following Chinchilla-or-better scaling laws).
See my edit to my comment above. Sounds like GPT-3 was actually 250x more compute than GPT-2. And Claude / GPT-4 are about 50x more compute than that? (Though unclear to me how much insight the Anthropic folks had into GPT-4′s training before the announcement. So possible the 50x number is accurate for Claude and not for GPT-4.)