Know if it’s reply to a prompt is actually useful.
Eg: prompt with “a helicopter is most efficient when … ”; “a helicopter is more efficient when”; and “helicopter efficiency can be improved by.” GPT-4 will not be able to know which response is the best. Or even if any of the responses would actually move helicopter efficiency in the right direction.
How do you think it would perform on simpler question closer to its training dataset, like “we throw a ball from a 500m building with no wind, and the same ball but with wind, which one hits the floor earlier” (on average, after 1000 questions).$? If this still does not seem plausible, what is something you would bet $100 2:1 but not 1:1 that it would not be able to do?
What do you mean by “on average after 1000 questions”? Because that is the crux of my answer: GPT-4 won’t be able to QA its own work for accuracy, or even relevance.
well if we’re doing a bet then at some point we need to “resolve” the prediction. so we ask GPT-4 the same physics question 1000 times and then some humans judges count how many it got right, if it gets it right more than let’s say 95% of the time (or any confidence interval) , then we would resolve this positively. of course you could do more than 1000, and with law of large numbers it should converge to the true probability of giving the right answer?
My assertion is more like:
After getting the content of elementary school science textbooks (or high school physics, or whatever other school science content makes sense), but not including the end-of-chapter questions (and especially not the answers), GPT-4 will be unable to provide the correct answer to more then 50% of the questions from the end of the chapters, constrained by having to take the first response that looks like a solution as it’s “answer” and not throwing away more than 3 obviously gibberish or bullshit responses per question.
And that 50% number is based on giving it every question without discrimination. If we only count the synthesis questions (as opposed to the memory/definition questions), I predict 1%, but would bet on < 10%
let’s say by concatenating your textbooks you get plenty of examples of f=m⋅a with “blablabla object sky blablabla gravity a=9.8m/s2 blablabla m=12kg blabla f=12∗9.8=120N. And then the exercise is: “blablabla object of mass blablabla thrown from the sky, what’s the force? a) f=120 b) … c) … d) …”. then what you need to do is just do some prompt programming at the beginning by “for looping answer” and teaching it to return either a,b,c or d. Now, I don’t see any reason why a neural net couldn’t approximate linear functions of two variables. It just needs to map words like “derivative of speed”, “acceleration”, “d2z/dt2” to the same concept and then look at it with attention & multiply two digits.
Know if it’s reply to a prompt is actually useful.
Eg: prompt with “a helicopter is most efficient when … ”; “a helicopter is more efficient when”; and “helicopter efficiency can be improved by.” GPT-4 will not be able to know which response is the best. Or even if any of the responses would actually move helicopter efficiency in the right direction.
So physics understanding.
How do you think it would perform on simpler question closer to its training dataset, like “we throw a ball from a 500m building with no wind, and the same ball but with wind, which one hits the floor earlier” (on average, after 1000 questions).$? If this still does not seem plausible, what is something you would bet $100 2:1 but not 1:1 that it would not be able to do?
What do you mean by “on average after 1000 questions”? Because that is the crux of my answer: GPT-4 won’t be able to QA its own work for accuracy, or even relevance.
well if we’re doing a bet then at some point we need to “resolve” the prediction. so we ask GPT-4 the same physics question 1000 times and then some humans judges count how many it got right, if it gets it right more than let’s say 95% of the time (or any confidence interval) , then we would resolve this positively. of course you could do more than 1000, and with law of large numbers it should converge to the true probability of giving the right answer?
That wouldn’t be useful, though.
My assertion is more like: After getting the content of elementary school science textbooks (or high school physics, or whatever other school science content makes sense), but not including the end-of-chapter questions (and especially not the answers), GPT-4 will be unable to provide the correct answer to more then 50% of the questions from the end of the chapters, constrained by having to take the first response that looks like a solution as it’s “answer” and not throwing away more than 3 obviously gibberish or bullshit responses per question.
And that 50% number is based on giving it every question without discrimination. If we only count the synthesis questions (as opposed to the memory/definition questions), I predict 1%, but would bet on < 10%
let’s say by concatenating your textbooks you get plenty of examples of f=m⋅a with “blablabla object sky blablabla gravity a=9.8m/s2 blablabla m=12kg blabla f=12∗9.8=120N. And then the exercise is: “blablabla object of mass blablabla thrown from the sky, what’s the force? a) f=120 b) … c) … d) …”. then what you need to do is just do some prompt programming at the beginning by “for looping answer” and teaching it to return either a,b,c or d. Now, I don’t see any reason why a neural net couldn’t approximate linear functions of two variables. It just needs to map words like “derivative of speed”, “acceleration”, “d2z/dt2” to the same concept and then look at it with attention & multiply two digits.
Generally the answers aren’t multiple choice. Here’s a couple examples of questions from a 5th grade science textbook I found on Google:
How would you state your address in space. Explain your answer.
Would you weigh the same on the sun as you do on Earth. Explain your answer.
Why is it so difficult to design a real-scale model of the solar system?
If it’s about explaining your answer with 5th grade gibberish then GPT-4 is THE solution for you! ;)