philh comments on A command-line grammar of graphics

philh 1 Apr 2021 19:12 UTC
2 points
Substance: is grammar of graphics actually a good paradigm? It’s a good question, and I’m not convinced my “it’s the One True Way” feeling comes from a place of “yes I have good reason to think this is a good paradigm”. I haven’t actually thought much about it prior to this, so the rest of my comment is kind of tentative.

So let’s say for now we don’t need any form of interactivity, it’s fine to just think of a plot as being a list of pixels. I’m not sure we do have the tradeoff you describe? Certainly it’s not a one-dimensional one. You could imagine a program that forces you to just set every pixel, and then you could imagine that it adds functions for “draw a line”, “draw a filled-in rectangle”, but you still have access to the raw pixels. And then it can add “draw a bar chart” and “draw a line graph”, and so on, all the way up to “draw a quasi rectiliniar radial spiral helix fourier plot”, and it never needs to lose access to “draw a raw pixel”.

The awkward thing is, once you have “draw a bar chart” etc., the programmer doesn’t necessarily know which pixels will get set, and at that point “draw a pixel” becomes a lot less useful. But that’s kind of true with the lower-level primitives too, as soon as you start calling them based on runtime data. Is there space in one corner to place the legend? That’s not necessarily easier to figure out when you’re just drawing pixels than when you’re calling a high-level “draw graph” function.

(Though it might be less differential effort. Like, if you’re already looping through your data manually, you can add a flag for any points in the corner. If you’re just passing your data to another function that loops through it, you now need to add a manual loop. And if you don’t know exactly where that other function draws, based on the data, maybe you don’t know when to set that flag… but the worst case scenario is that function doesn’t make your life easier, and then you can just not use it.)

Where I’m going at with this: okay, suppose you’re using plotnine and it doesn’t implement the kind of plot you want. Is it any harder to implement that plot in plotnine than it would be in matplotlib? I’m not sure it is. If you want to balance several small bars on top of a big one, in matplotlib you need to figure out the x,y,w,h (or equivalent) of a bunch of rectangles. In plotnine, if you have the x,y,w,h of a bunch of rectangles, you can just draw them. It’s maybe a little more friction, for example you might be less familiar with the “draw an arbitrary rectangle” component than the “draw a rectangle given just x,h” component that figures out y,w for you (and is probably implemented in terms of the previous). But, I guess it feels like relatively low friction compared to the hassle of figuring out the coordinates.

So that’s part of my answer. You say there’s a tradeoff to be made, but I’m not sure a grammar of graphics is taking significant losses on that tradeoff.

And on the other hand, is it making significant gains?

A naive answer: each layer in ggplot or plotnine has a “geom”, a “stat” and a “position”. You can mix-and-match these, so $O (l + m + n)$ effort gives you $O (l m n)$ types of graph.

This is obviously silly. Some of the elements aren’t compatible with each other, and some of those types of graph you’d never want. But do you get some gains in that direction? It seems to me that you do; the same position adjustment (“dodge”) that puts your bars in a bar chart side-by-side will probably put your boxplots side-by-side too. On the other hand it might not be loads—it looks like most stats are implemented for specific geoms, for example. There’s only one geom that uses stat_boxplot by default, and only one that uses stat_ydensity, and stat_smooth. You could use stat_boxplot with a geom other than geom_boxplot, but I don’t know if you ever would. I guess one thing you do get from this setup is, with a clear distinction between the statistical transformation and the data-drawing, you’re unlikely to ever say “aw man, this boxplot-drawing function expects me to pass in my raw data to compute the statistics itself, but I already have the statistics and I threw out the raw data”. That’s fine, you just use geom_boxplot with stat_identity.

So, I guess my sense is that a grammar of graphics does help make things easier, relative to lower-level things. Like, makes more types of things easy, with less effort (and less forethought) needed from the people making the existing things.

But I’m not super confident about either of these, and this is almost entirely theoretical, so.