It’s tricky because different ways to interpret the statement can give different answers. Even if we restrict ourselves to metrics that are monotone transformations of each other, such transformations don’t generally preserve derivatives.
Your example is good. As an additional example, if someone were particularly interested in the Uniform Bar Exam (where GPT-3.5 scores 10th percentile and GPT-4 scores 90th percentile), they would justifiably perceive an acceleration in capabilities.
So ultimately the measurement is always going to involve at least a subjective choice of which metric to choose.
Suppose the current generation, gpt 4, is not quite good enough at designing improved AIs to be worth spending finite money supplying it with computational resources. (So in this example, gpt-4 is dumb enough hypothetically it would need 5 billion in compute to find a gpt-5, while open AI could pay humans and buy a smaller amount of hardware and find it with 2 billion)
But gpt-5 needs just 2 billion to find gpt-6, while openAI needs 3 billion to do it with humans. (Because 6 is harder than 5 and so on)
Gpt-6 has enough working memory and talent it finds 7 with 1 billion...
And so on until gpt-n is throttled by already being too effective at using all the compute it is supplied that it would be a waste of effort to have it spend compute on n+1 development when it could just do tasks to pay for more compute or to pay for robots to collect new scientific data it can then train on.
I call the process “find” because it’s searching a vast possibility space of choices you make at each layer of the system.
Same thing goes for self replicating robots. If Robots are too dumb, they won’t make enough new robot parts (or economic value gain since at least at first these things will operate in the human economy) to pay for another copy of 1 robot on average before the robot wears out or screws up enough to wreck itself.
Each case above a small increase in intelligence could go from “process damps to zero” to “process gains exponentially”
The former gives you a system that fails 20 percent of the time still, the latter halves your error rate.
The former results in the error rate being 2050=40% of the previous one, while the latter in it being 1020=50%, so the former would appear to be a bigger step?
you’re right. I was latched on the fact that with the former case, you still have to babysit a lot, because 1⁄5 times is a lot of errors, while 1⁄10 is starting to approach viability for some tasks.
How would you measure this more objectively?
What bugs me is that in terms of utility, the step from 50 percent accuracy to 80 percent is smaller than the step from 80 percent to 90 percent.
The former gives you a system that fails 20 percent of the time still, the latter halves your error rate.
The 90 to 95 percent accuracy is an even larger utility gain—half the babysitting, system is good enough for lower stakes jobs.
And so on with halvings, where 99 to 99.5 percent is a larger step than all the prior ones.
It’s tricky because different ways to interpret the statement can give different answers. Even if we restrict ourselves to metrics that are monotone transformations of each other, such transformations don’t generally preserve derivatives.
Your example is good. As an additional example, if someone were particularly interested in the Uniform Bar Exam (where GPT-3.5 scores 10th percentile and GPT-4 scores 90th percentile), they would justifiably perceive an acceleration in capabilities.
So ultimately the measurement is always going to involve at least a subjective choice of which metric to choose.
Right. Or what really matters, criticality gain.
Suppose the current generation, gpt 4, is not quite good enough at designing improved AIs to be worth spending finite money supplying it with computational resources. (So in this example, gpt-4 is dumb enough hypothetically it would need 5 billion in compute to find a gpt-5, while open AI could pay humans and buy a smaller amount of hardware and find it with 2 billion)
But gpt-5 needs just 2 billion to find gpt-6, while openAI needs 3 billion to do it with humans. (Because 6 is harder than 5 and so on)
Gpt-6 has enough working memory and talent it finds 7 with 1 billion...
And so on until gpt-n is throttled by already being too effective at using all the compute it is supplied that it would be a waste of effort to have it spend compute on n+1 development when it could just do tasks to pay for more compute or to pay for robots to collect new scientific data it can then train on.
I call the process “find” because it’s searching a vast possibility space of choices you make at each layer of the system.
Same thing goes for self replicating robots. If Robots are too dumb, they won’t make enough new robot parts (or economic value gain since at least at first these things will operate in the human economy) to pay for another copy of 1 robot on average before the robot wears out or screws up enough to wreck itself.
Each case above a small increase in intelligence could go from “process damps to zero” to “process gains exponentially”
The former results in the error rate being 2050=40% of the previous one, while the latter in it being 1020=50%, so the former would appear to be a bigger step?
you’re right. I was latched on the fact that with the former case, you still have to babysit a lot, because 1⁄5 times is a lot of errors, while 1⁄10 is starting to approach viability for some tasks.