It shortens timelines if you were surprised by how impressive this was, and how few impressive insights seem required to give big performance increases. But the how big an impact this has depends on your priors.
To me, this thing isn’t that much more impressive than Codex in the kinds of problems it solved. It can perform some simple algorithms if you give it enough tries to a simple problem, but that is a long way away from being able to modify the kind of code it is made up of. Or even just producing anywhere near peak human insights into algorithms.
Now you might wonder if that is just a result of the model being too small, the dataset too tiny or whatever. But I don’t think so. The changes they introduced are not going to make efficient use of compute and it seems like it is probably at the point where it won’t improve with much more data. Doubling all the models resources and size is not going to get you double, or probably even a 50% increase in results. So I’m not that much more worried i.e. my timelines didn’t just get cut in half by this.
Nice. In what measure would you say it affected your timelines, like a rough approximation? And what would your timelines be at this point? I know this is hard to tell so no offense if you don’t wanna answer, but I’m indeed curious.
I don’t really know. I guess I’d say I feel like it might have shortened timelines by a few months to a year, but I think I had longer time horizons than most people on Lesswrong. I know that’s not that helpful, but I don’t think I have an explicit date of when I think catastrophe will arrive.
That is the sampling part of AlphaGo. When you ask it a question, if you let take more and more samples as potential solutions, the more likely you are to get a correct answer. The success rate of this seems to grow log linearly with the number of samples, which is pretty poor.
Oh, I see, I thought you meant AlphaCode had some sort of recurrency (does it?), i.e. iterating over its answer like a human would when debugging code. Log-linear doesn’t sound bad, you wouldn’t expect a human to be able to write a correct program on the first try without debugging either, would you?
It is pretty bad because the constant matters here. If you have a method which achieves a success rate % of C log( 1+ k) where k is the number of samples then, if C =1, you’d need 2^100~10^30 many samples per inference to get all the competitive programing questions right. As it stands, C seems pretty small. For k=100000 you get solve maybe 33% of problems. And it isn’t much better than k=1000.
That’s leaving aside the question of whether you will continue to get log-linear improvements, which I find fairly unlikely. Though if you could get C to something like 10 or 20 with one or two innovations, I’d be scared or even terrified.
Yeah, basically, but I was using log_2(x) instead of log_10(x) in my made up example. Here’s some actual data. What they mean by “unlimited attempts” is that there is no filter on the sample solutions, so they are all submitted to codebase. I expect false positives/incredibly slow to be more likely than performance actually increasing without limit.
It shortens timelines if you were surprised by how impressive this was, and how few impressive insights seem required to give big performance increases. But the how big an impact this has depends on your priors.
To me, this thing isn’t that much more impressive than Codex in the kinds of problems it solved. It can perform some simple algorithms if you give it enough tries to a simple problem, but that is a long way away from being able to modify the kind of code it is made up of. Or even just producing anywhere near peak human insights into algorithms.
Now you might wonder if that is just a result of the model being too small, the dataset too tiny or whatever. But I don’t think so. The changes they introduced are not going to make efficient use of compute and it seems like it is probably at the point where it won’t improve with much more data. Doubling all the models resources and size is not going to get you double, or probably even a 50% increase in results. So I’m not that much more worried i.e. my timelines didn’t just get cut in half by this.
Nice. In what measure would you say it affected your timelines, like a rough approximation? And what would your timelines be at this point? I know this is hard to tell so no offense if you don’t wanna answer, but I’m indeed curious.
I don’t really know. I guess I’d say I feel like it might have shortened timelines by a few months to a year, but I think I had longer time horizons than most people on Lesswrong. I know that’s not that helpful, but I don’t think I have an explicit date of when I think catastrophe will arrive.
What do you mean by “give it enough tries”?
That is the sampling part of AlphaGo. When you ask it a question, if you let take more and more samples as potential solutions, the more likely you are to get a correct answer. The success rate of this seems to grow log linearly with the number of samples, which is pretty poor.
Oh, I see, I thought you meant AlphaCode had some sort of recurrency (does it?), i.e. iterating over its answer like a human would when debugging code. Log-linear doesn’t sound bad, you wouldn’t expect a human to be able to write a correct program on the first try without debugging either, would you?
It is pretty bad because the constant matters here. If you have a method which achieves a success rate % of C log( 1+ k) where k is the number of samples then, if C =1, you’d need 2^100~10^30 many samples per inference to get all the competitive programing questions right. As it stands, C seems pretty small. For k=100000 you get solve maybe 33% of problems. And it isn’t much better than k=1000.
That’s leaving aside the question of whether you will continue to get log-linear improvements, which I find fairly unlikely. Though if you could get C to something like 10 or 20 with one or two innovations, I’d be scared or even terrified.
To check my understanding:
If C was 20, then 100,000 samples would be enough to solve approximately 100% of these problems?
But instead C is something like 6-7, because 100,000 samples gets you to 33%?
Yeah, basically, but I was using log_2(x) instead of log_10(x) in my made up example. Here’s some actual data. What they mean by “unlimited attempts” is that there is no filter on the sample solutions, so they are all submitted to codebase. I expect false positives/incredibly slow to be more likely than performance actually increasing without limit.