wolajacy comments on Moving Data Around is Slow

wolajacy 22 Mar 2021 18:04 UTC
3 points
AFAIK, popular data science tools (Spark, Pandas, etc.) already use columnar formats for data serialization and network-based communication: https://en.wikipedia.org/wiki/Apache_Arrow
Similiar idea for disk storage (which is again orders of magnitude slower, so the gains in certain situations might be even bigger): https://en.wikipedia.org/wiki/Apache_Parquet
Generally, if you’re doing big data, there are actually more benefits from using this layout—data homogenity means much better compression and possibilities for smarter encodings.
- SatvikBeri 22 Mar 2021 18:32 UTC
  3 points
  Parent
  Yup, these are all reasons to prefer column orientation over row orientation for analytics workloads. In my opinion data locality trumps everything but compression and fast transmission is definitely very nice.
  Until recently, numpy and pandas were row oriented, and this was a major bottleneck. A lot of pandas’s strange API is apparently due to working around row orientation. See e.g. this article by Wes McKinney, creator of pandas: https://wesmckinney.com/blog/apache-arrow-pandas-internals/#:~:text=Arrow’s%20C%2B%2B%20implementation%20provides%20essential,optimized%20for%20analytical%20processing%20performance