This is a “live” Fermi estimate, where I expect to make mistakes and haven’t done much editing.
If you’re working on a simple mathematical program, you can attempt to estimate the number of floating point or integer operations, compare that to statistics on CPU latencies/throughput, and use that to get a rough estimate.
Let’s try estimating how long it would take to run a simple primality test, a modified Sieve of Eratosthenes:
function modified_sieve(n)
known_primes = [2]
for i in 3:2:n
mods = [i % p for p in known_primes]
if all(mods .> 0)
push!(i, known_primes)
end
end
return known_primes
end
In this algorithm, we store a list of known prime numbers. We iterate through odd numbers, and check whether any of them are divisible by a known prime, by calculating the modulus against every known prime in our list. If it’s not divisible by any of them, we add it to our list of primes.
There’s a lot we could do to come up with a better algorithm, but what I’m interested in is estimating how fast an ideal execution of this algorithm could be. How long could we expect this to run, on a modern CPU?
I expect, looking at this, that the vast majority of the time is going to be spent on mods = [i % p for p in known_primes]. Calculating a modulus is something like 30-50 cycles according to the few google searches I found, and everything else should be significantly less than that.
The number of mods we need to calculate for any i is the number of known primes, which is roughly ilog(i). Let’s say we’re getting all the primes until 100,000. Then we should need to calculate about Σ99,999i=3ilog(i), which is about 200 million mods. Times 30-50 cycles is about .6-1 billion cycles. On a 3gHz processor, that should take about a 200-333 milliseconds.
What do we actually see?
using BenchmarkTools
@btime modified_sieve(100_000)
> 352.089 ms (240255 allocations: 2.12 GiB)
That...shouldn’t actually happen. Actual programs usually take 3x-10x as long as my estimates, because there’s a lot of overhead I’m not accounting for. This suggests that there’s some optimization happening that I didn’t expect. And looking at the generated code, I do see several references to simdloop. Ok then. That just happened to cancel out the overhead.
Let’s try an algorithmic improvement. We don’t actually need to check against all primes – we only have to go up to √i each time. That brings us down to an estimated 2 million mods, which in theory should be 100 times faster.
function modified_sieve2(n)
known_primes = [2]
for i in 3:2:n
mods = [i % p for p in known_primes if p^2 < i]
if all(mods .> 0)
push!(known_primes, i)
end
end
return known_primes
end
@btime modified_sieve2(100_000)
> 104.270 ms (336536 allocations: 83.04 MiB)
Hmm. It’s faster, but nowhere near as much as expected – 3.5x vs. 100x. Adding the filter each loop probably messes us up in a couple ways: it adds a fair bit of work, and it makes it harder for the CPU to predict which calculations need to happen. Let’s see if we can get a speed up by making the primes to be checked more predictable:
function modified_sieve3(n)
known_primes = [2]
small_primes = Int[]
next_small_prime_index = 1
for i in 3:2:n
if known_primes[next_small_prime_index]^2 <= i
push!(small_primes, known_primes[next_small_prime_index])
next_small_prime_index += 1
end
mods = [i % p for p in small_primes]
if all(mods .> 0)
push!(known_primes, i)
end
end
return known_primes
end
Now we added a bunch of extra code to make the mod calculations more predictable: we track the primes that need to be tested separately, and only update them when needed. And it pays off:
@btime modified_sieve3(100_000)
> 9.291 ms (150010 allocations: 25.87 MiB)
This is a major benefit of making Fermi estimates. We predicted a possible 100x speedup, and got only 3.5x. That suggested that there was more on the table, and were able to get a further 10x. If I hadn’t made the estimate, I might have thought the filter version was the best possible, and stopped there.
And I find this to translate pretty well to real projects – the functions where I estimate the greatest potential improvements are the ones that I actually end up being able to speed up a lot, and this helps me figure out whether it’s worth spending time optimizing a piece of code.
Fermi Estimating How Long an Algorithm Takes
This is a “live” Fermi estimate, where I expect to make mistakes and haven’t done much editing.
If you’re working on a simple mathematical program, you can attempt to estimate the number of floating point or integer operations, compare that to statistics on CPU latencies/throughput, and use that to get a rough estimate.
Let’s try estimating how long it would take to run a simple primality test, a modified Sieve of Eratosthenes:
In this algorithm, we store a list of known prime numbers. We iterate through odd numbers, and check whether any of them are divisible by a known prime, by calculating the modulus against every known prime in our list. If it’s not divisible by any of them, we add it to our list of primes.
There’s a lot we could do to come up with a better algorithm, but what I’m interested in is estimating how fast an ideal execution of this algorithm could be. How long could we expect this to run, on a modern CPU?
I expect, looking at this, that the vast majority of the time is going to be spent on
mods = [i % p for p in known_primes]. Calculating a modulus is something like 30-50 cycles according to the few google searches I found, and everything else should be significantly less than that.The number of mods we need to calculate for any
iis the number of known primes, which is roughly ilog(i). Let’s say we’re getting all the primes until 100,000. Then we should need to calculate about Σ99,999i=3ilog(i), which is about 200 million mods. Times 30-50 cycles is about .6-1 billion cycles. On a 3gHz processor, that should take about a 200-333 milliseconds.What do we actually see?
That...shouldn’t actually happen. Actual programs usually take 3x-10x as long as my estimates, because there’s a lot of overhead I’m not accounting for. This suggests that there’s some optimization happening that I didn’t expect. And looking at the generated code, I do see several references to
simdloop. Ok then. That just happened to cancel out the overhead.Let’s try an algorithmic improvement. We don’t actually need to check against all primes – we only have to go up to √i each time. That brings us down to an estimated 2 million mods, which in theory should be 100 times faster.
Hmm. It’s faster, but nowhere near as much as expected – 3.5x vs. 100x. Adding the filter each loop probably messes us up in a couple ways: it adds a fair bit of work, and it makes it harder for the CPU to predict which calculations need to happen. Let’s see if we can get a speed up by making the primes to be checked more predictable:
Now we added a bunch of extra code to make the
modcalculations more predictable: we track the primes that need to be tested separately, and only update them when needed. And it pays off:This is a major benefit of making Fermi estimates. We predicted a possible 100x speedup, and got only 3.5x. That suggested that there was more on the table, and were able to get a further 10x. If I hadn’t made the estimate, I might have thought the filter version was the best possible, and stopped there.
And I find this to translate pretty well to real projects – the functions where I estimate the greatest potential improvements are the ones that I actually end up being able to speed up a lot, and this helps me figure out whether it’s worth spending time optimizing a piece of code.