Has anyone studied the Red Black Tree algorithms recently? I’ve been trying to implement them using my Finite State technique that enables automatic generation of flow diagrams. This has been working well for several other algorithms.
But the Red Black tree rebalancing algorithms seem ridiculously complicated. Here is an image of the deletion process (extracted from this Java code) - it’s far more complicated than an algorithm like MergeSort or HeapSort, and that only shows the deletion procedure!
I’m weighing two hypotheses:
Keeping a binary tree balanced in N log N time is an intrinsically complex task.
There is some much simpler method to efficiently maintain balance in a binary tree, but nobody bothered looking for it after the RB tree algorithms and analysis were published.
I’m leaning toward the latter theory. It seems to me that most of the other “elementary” algorithms of computer science are comparatively simple, so the weird overcomplexity of the tool we use for binary tree balancing is some kind of oversight. Here is the Wiki page on RB trees—notice how the description of the algorithm is extremely hard to understand.
An intuition is that red-black trees encode 2-3-4 trees (B-trees of order 4) as binary trees.
For a simpler case, 2-3 trees (Ie. B-trees of order 3) are either empty, a (2-)node with 1 value and 2 subtrees, or a (3-)node with 2 values and 3 subtrees. The idea is to insert new values in their sorted position, expand 2-nodes to 3-nodes if necessary, and bubble up the extra values when a 3-node should be expanded. This keeps the tree balanced.
A 2-3-4 tree just generalises the above.
Now the intuition is that red means “I am part of a bigger node.” That is, red nodes represent the values contained in some higher black node. If the black node represents a 2-node, it has no red children. If it represents a 3-node, it has one red child, and if it represents a 4-node, it has 2 red children.
In this context, the “rules” of the red-black trees make complete sense. For instance we only count black trees when comparing branch heights because those represent the actual nodes. I’m sure that with a bit of work, it’s possible to make complete sense of the insertion/deletion rules through the B-tree lens but I haven’t done it.
edit: I went through the insertion rules and they do make complete sense if you think about a B-tree while you read them.
There are other kinds of binary tree with simpler rebalancing procedures, most notably the AVL tree mentioned by cousin_it. I think red-black tends to dominate for some combination of these reasons:
Tradition. Some influential sources (e.g., Sedgwick’s algorithms book[1], SGI’s STL implementation) used, or gave more visibility to, red-black trees, and others copied them.
Fewer rotations in rebalancing. In some circumstances (certainly deletion; I forget whether it’s true for insertion too) AVL trees may need to do Theta(log n) rotations, whereas red-black never need more than O(1).
Does this mean an actual advantage in performance? Maaaaybe. Red-black trees are, in the worst case at least, worse-balanced, which may actually matter more. Such benchmarks as I’ve seen don’t suggest a very big advantage for either red-black or AVL over the other.
Persistence. If you want to make a persistent data structure out of a binary tree, whether for practical reasons or just to show your students in Haskell, it’s easier with a red-black tree.
Metadata requirements. A red-black tree needs one bit per node, to store the redness/blackness. An AVL tree needs one and a half bits :-) to store the −1/0/+1 height difference. Perhaps in some implementations it’s OK to “waste” one bit per node but not two.
[1] I think. I don’t have a copy myself. Surely it must at least mention AVL trees too, but my hazy recollection is that the balanced-tree algorithm Sedgwick gives most space to is red-black.
I expect it is—I haven’t looked. But AIUI the usual way of making this sort of structure persistent amounts to remembering a log of everything you did, so the requirement for more rotations means that a persistent AVL tree ends up less space-efficient than a persistent red-black tree. Something like that, anyway, I haven’t attempted to build either.
In Haskell, or any GC language really, you don’t need a log to make a tree data structure persistent. Just always allocate new nodes instead of changing old ones, so all operations will return a new version of the whole tree. That’s O(log n) allocations per operation, because most subtrees can be shared between the old and new versions, only the nodes along one path down the tree need to be new. Anyone who kept a reference to an old version’s root can keep accessing the entire old version, because nodes never change. And if nobody kept a reference, GC picks it up. That way it’s also easy to have branching history, which seems impossible with a log.
Good analysis, thanks. I buy the first two points. I’d be shocked to see an implementation that actually makes use of the lower metadata requirements. Are there languages that provide a boolean primitive that uses a single bit of memory instead of a full byte? Also I don’t understand what you mean by persistence.
Suppose you’re making tree structures in a pure functional language, where there is no mutable state. Then what you need is functions that e.g. take a tree and a new element and return a new tree, sharing as much of its structure as possible with the old for efficiency’s sake, that has the new element in it. These are sometimes called persistent data structures because the old versions stick around (or at least might; they might get garbage-collected once the runtime can prove nothing will ever use them).
I always felt that AVL trees were easier to understand than red-black. Just wrote some Haskell code for you. As you can see, both insertion and deletion are quite simple and rely on the same rebalancing operation.
AVL trees always felt simpler to me than red-black. Here’s a quick implementation in Haskell, adapted from some code I found online. Delete seems to be about as complex as insert. It might have bugs, but I’m pretty sure the correct version would be about as long.
Edit: it seems like someone has implemented deletion for RB trees in Haskell as well, and it’s doesn’t look too complicated. I haven’t checked it carefully though.
Has anyone studied the Red Black Tree algorithms recently? I’ve been trying to implement them using my Finite State technique that enables automatic generation of flow diagrams. This has been working well for several other algorithms.
But the Red Black tree rebalancing algorithms seem ridiculously complicated. Here is an image of the deletion process (extracted from this Java code) - it’s far more complicated than an algorithm like MergeSort or HeapSort, and that only shows the deletion procedure!
I’m weighing two hypotheses:
Keeping a binary tree balanced in N log N time is an intrinsically complex task.
There is some much simpler method to efficiently maintain balance in a binary tree, but nobody bothered looking for it after the RB tree algorithms and analysis were published.
I’m leaning toward the latter theory. It seems to me that most of the other “elementary” algorithms of computer science are comparatively simple, so the weird overcomplexity of the tool we use for binary tree balancing is some kind of oversight. Here is the Wiki page on RB trees—notice how the description of the algorithm is extremely hard to understand.
An intuition is that red-black trees encode 2-3-4 trees (B-trees of order 4) as binary trees.
For a simpler case, 2-3 trees (Ie. B-trees of order 3) are either empty, a (2-)node with 1 value and 2 subtrees, or a (3-)node with 2 values and 3 subtrees. The idea is to insert new values in their sorted position, expand 2-nodes to 3-nodes if necessary, and bubble up the extra values when a 3-node should be expanded. This keeps the tree balanced.
A 2-3-4 tree just generalises the above.
Now the intuition is that red means “I am part of a bigger node.” That is, red nodes represent the values contained in some higher black node. If the black node represents a 2-node, it has no red children. If it represents a 3-node, it has one red child, and if it represents a 4-node, it has 2 red children.
In this context, the “rules” of the red-black trees make complete sense. For instance we only count black trees when comparing branch heights because those represent the actual nodes. I’m sure that with a bit of work, it’s possible to make complete sense of the insertion/deletion rules through the B-tree lens but I haven’t done it.
edit: I went through the insertion rules and they do make complete sense if you think about a B-tree while you read them.
There are other kinds of binary tree with simpler rebalancing procedures, most notably the AVL tree mentioned by cousin_it. I think red-black tends to dominate for some combination of these reasons:
Tradition. Some influential sources (e.g., Sedgwick’s algorithms book[1], SGI’s STL implementation) used, or gave more visibility to, red-black trees, and others copied them.
Fewer rotations in rebalancing. In some circumstances (certainly deletion; I forget whether it’s true for insertion too) AVL trees may need to do Theta(log n) rotations, whereas red-black never need more than O(1).
Does this mean an actual advantage in performance? Maaaaybe. Red-black trees are, in the worst case at least, worse-balanced, which may actually matter more. Such benchmarks as I’ve seen don’t suggest a very big advantage for either red-black or AVL over the other.
Persistence. If you want to make a persistent data structure out of a binary tree, whether for practical reasons or just to show your students in Haskell, it’s easier with a red-black tree.
Metadata requirements. A red-black tree needs one bit per node, to store the redness/blackness. An AVL tree needs one and a half bits :-) to store the −1/0/+1 height difference. Perhaps in some implementations it’s OK to “waste” one bit per node but not two.
[1] I think. I don’t have a copy myself. Surely it must at least mention AVL trees too, but my hazy recollection is that the balanced-tree algorithm Sedgwick gives most space to is red-black.
Hm, my implementation of AVL trees seems persistent enough?
I expect it is—I haven’t looked. But AIUI the usual way of making this sort of structure persistent amounts to remembering a log of everything you did, so the requirement for more rotations means that a persistent AVL tree ends up less space-efficient than a persistent red-black tree. Something like that, anyway, I haven’t attempted to build either.
In Haskell, or any GC language really, you don’t need a log to make a tree data structure persistent. Just always allocate new nodes instead of changing old ones, so all operations will return a new version of the whole tree. That’s O(log n) allocations per operation, because most subtrees can be shared between the old and new versions, only the nodes along one path down the tree need to be new. Anyone who kept a reference to an old version’s root can keep accessing the entire old version, because nodes never change. And if nobody kept a reference, GC picks it up. That way it’s also easy to have branching history, which seems impossible with a log.
Good analysis, thanks. I buy the first two points. I’d be shocked to see an implementation that actually makes use of the lower metadata requirements. Are there languages that provide a boolean primitive that uses a single bit of memory instead of a full byte? Also I don’t understand what you mean by persistence.
Suppose you’re making tree structures in a pure functional language, where there is no mutable state. Then what you need is functions that e.g. take a tree and a new element and return a new tree, sharing as much of its structure as possible with the old for efficiency’s sake, that has the new element in it. These are sometimes called persistent data structures because the old versions stick around (or at least might; they might get garbage-collected once the runtime can prove nothing will ever use them).
I always felt that AVL trees were easier to understand than red-black. Just wrote some Haskell code for you. As you can see, both insertion and deletion are quite simple and rely on the same rebalancing operation.
Very nice, thanks. Ahh… Haskell really is quite pretty.
AVL trees always felt simpler to me than red-black. Here’s a quick implementation in Haskell, adapted from some code I found online. Delete seems to be about as complex as insert. It might have bugs, but I’m pretty sure the correct version would be about as long.
Edit: it seems like someone has implemented deletion for RB trees in Haskell as well, and it’s doesn’t look too complicated. I haven’t checked it carefully though.