I see where that intuition comes from, and at first I thought that would be the case. But the machine is very good at iterating through pairs of arrays. Continuing the previous example:
function my_double_sum(data)
sum_heights_and_weights = 0
for row in data
sum_heights_and_weights += row.weight + row.height
end
return sum_heights_and_weights
end
@btime(my_double_sum(heights_and_weights))
> 50.342 ms (1 allocation: 16 bytes)
function my_double_sum2(heights, weights)
sum_heights_and_weights = 0
for (height, weight) in zip(heights, weights)
sum_heights_and_weights += height + weight
end
return sum_heights_and_weights
end
@btime(my_double_sum2(just_heights, just_weights))
> 51.060 ms (1 allocation: 16 bytes)
Looks like it is marginally quicker in the first of those cases. Note that you’re iterating over the objects linearly, which means that the processor’s memory-access prediction hardware will have a nice easy job; and you’re iterating over the whole thing without repeating any, which means that all the cache is buying you is efficient access to prefetched bits. After defining
function mss(data::Vector{HeightAndWeight})
s = 0
for i=1:40000000
j=((i*i)%40000000)+1
@inbounds s += data[j].weight * data[j].height
end
return s
end
function mss2(heights::Vector{Int},weights::Vector{Int})
s = 0
for i=1:40000000
j=((i*i)%40000000)+1
@inbounds s += weights[j] * heights[j]
end
return s
end
(mss for “my scattered sum”; the explicit type declarations, literal size and @inbounds made it run faster on my machine, and getting rid of overhead seems like a good idea for such comparisons; the squaring is just a simple way to get something not too predictable) I got the following timings:
julia> @btime(mss(heights_and_weights))
814.056 ms (0 allocations: 0 bytes)
400185517392
julia> @btime(mss2(just_heights,just_weights))
1.253 s (0 allocations: 0 bytes)
400185517392
so the array-of-structs turns out to work quite a lot better in this case. (Because we’re doing half the number of hard-to-predict memory accesses.)
Note how large those timings are, by the way. If I just process every row in order, I get 47.9ms for array-of-structs and 42.6ms for struct-of-arrays. 40.3ms if I use zip as you did instead of writing array accesses explicitly, which is interesting; I’m not that surprised the compiler can eliminate the superficial inefficiencies of the zip-loop, but I’m surprised it ends up strictly better rather than exactly the same. Anyway, this is the other way around from your example: struct-of-arrays is faster for me in that situation. But when we process things in “random” order, it’s 20-30x slower because we no longer get to make much use of the cache or memory-access prediction, and then array-of-structs wins handily.
… Actually, looking at the generated code, those aren’t the only reasons. It’s also that when iterating over the whole array linearly the compiler can generate nice SIMD code. Cute. [EDITED to add:] Of course this might also be part of the story when comparing SoA and AoS in the linear-iteration case; SoA may be more vectorization-friendly. Of course this is only relevant to some possible workloads; if you’re doing something much more complicated than a long string of multiply-accumulates, SIMD may no longer be a factor and the relationship between the two approaches may change. The moral: Benchmark, benchmark, benchmark!
Very cool, thanks for writing this up. Hard-to-predict access in loops is an interesting case, and it makes sense that AoS would beat SoA there.
Yeah, SIMD is a significant point I forgot to mention.
It’s a fair amount of work to switch between SoA and AoS in most cases, which makes benchmarking hard!StructArrays.jl makes this pretty doable in Julia, and Jonathan Blow talks about making it simple to switch between SoA and AoS in his programming language Jai. I would definitely like to see more languages making it easy to just try one and benchmark the results.
I see where that intuition comes from, and at first I thought that would be the case. But the machine is very good at iterating through pairs of arrays. Continuing the previous example:
Looks like it is marginally quicker in the first of those cases. Note that you’re iterating over the objects linearly, which means that the processor’s memory-access prediction hardware will have a nice easy job; and you’re iterating over the whole thing without repeating any, which means that all the cache is buying you is efficient access to prefetched bits. After defining
(
mss
for “my scattered sum”; the explicit type declarations, literal size and@inbounds
made it run faster on my machine, and getting rid of overhead seems like a good idea for such comparisons; the squaring is just a simple way to get something not too predictable) I got the following timings:so the array-of-structs turns out to work quite a lot better in this case. (Because we’re doing half the number of hard-to-predict memory accesses.)
Note how large those timings are, by the way. If I just process every row in order, I get 47.9ms for array-of-structs and 42.6ms for struct-of-arrays. 40.3ms if I use zip as you did instead of writing array accesses explicitly, which is interesting; I’m not that surprised the compiler can eliminate the superficial inefficiencies of the zip-loop, but I’m surprised it ends up strictly better rather than exactly the same. Anyway, this is the other way around from your example: struct-of-arrays is faster for me in that situation. But when we process things in “random” order, it’s 20-30x slower because we no longer get to make much use of the cache or memory-access prediction, and then array-of-structs wins handily.
… Actually, looking at the generated code, those aren’t the only reasons. It’s also that when iterating over the whole array linearly the compiler can generate nice SIMD code. Cute. [EDITED to add:] Of course this might also be part of the story when comparing SoA and AoS in the linear-iteration case; SoA may be more vectorization-friendly. Of course this is only relevant to some possible workloads; if you’re doing something much more complicated than a long string of multiply-accumulates, SIMD may no longer be a factor and the relationship between the two approaches may change. The moral: Benchmark, benchmark, benchmark!
Very cool, thanks for writing this up. Hard-to-predict access in loops is an interesting case, and it makes sense that AoS would beat SoA there.
Yeah, SIMD is a significant point I forgot to mention.
It’s a fair amount of work to switch between SoA and AoS in most cases, which makes benchmarking hard!
StructArrays.jl
makes this pretty doable in Julia, and Jonathan Blow talks about making it simple to switch between SoA and AoS in his programming language Jai. I would definitely like to see more languages making it easy to just try one and benchmark the results.