Sometimes you have a lot of data, and one approach to support quick
searches is pre-processing it to build an index so a search can
involve only looking at a small fraction of the total data. The
threshold at which it’s worth switching to indexing, though, might be
higher than you’d guess. Here are some cases I’ve worked on where
full scans were better engineering choices:
Ten years ago I wrote an interoffice messaging application for
a small billing service. Messages were stored in MySQL and I was
going to add indexing if full-text searches got slow or we had load
issues, but even with ten years worth of messages to search it stayed
responsive. (This was before MySQL supported FULLTEXT
indexes with InnoDB, added in v5.6,
and at the time indexing would have meant installing and maintaining
Sphinx.)
I recently came across
someone maintaining a 0.5GB full text index to support searching their
shell history, 100k commands. I usegrep on a flat file, and testing now it takes 200ms for a
query across my 180k history entries.
My contra dance search
tool ranks each dance in response to your query, with no
geospatial indexing, because there are just ~350 dances.
The viral counts
explorer I’ve been playing with for work searches the taxonomic
tree of human viruses in real time, scanning ~15k names with JS’s
“includes” command about as fast as you can type.
When I worked in ads I would often need to debug issues using
production logs, and would use Dremel (Melnik
2010, Melnik
2020) to run a distributed scan of very large amounts of data at
interactive speeds. Because queries were relatively rare, an index
would have been far more expensive to maintain.
Unless you know from the start that you’ll be searching hundreds of
millions of records, consider starting with simple scans and only add
indexing if you can’t get acceptable performance. And even then, if
queries are rare and highly varied you may still do better to do
the work at query time instead of ingestion time.
You Don’t Always Need Indexes
Link post
Sometimes you have a lot of data, and one approach to support quick searches is pre-processing it to build an index so a search can involve only looking at a small fraction of the total data. The threshold at which it’s worth switching to indexing, though, might be higher than you’d guess. Here are some cases I’ve worked on where full scans were better engineering choices:
Ten years ago I wrote an interoffice messaging application for a small billing service. Messages were stored in MySQL and I was going to add indexing if full-text searches got slow or we had load issues, but even with ten years worth of messages to search it stayed responsive. (This was before MySQL supported
FULLTEXT
indexes with InnoDB, added in v5.6, and at the time indexing would have meant installing and maintaining Sphinx.)I recently came across someone maintaining a 0.5GB full text index to support searching their shell history, 100k commands. I use
grep
on a flat file, and testing now it takes 200ms for a query across my 180k history entries.My contra dance search tool ranks each dance in response to your query, with no geospatial indexing, because there are just ~350 dances.
The viral counts explorer I’ve been playing with for work searches the taxonomic tree of human viruses in real time, scanning ~15k names with JS’s “includes” command about as fast as you can type.
When I worked in ads I would often need to debug issues using production logs, and would use Dremel (Melnik 2010, Melnik 2020) to run a distributed scan of very large amounts of data at interactive speeds. Because queries were relatively rare, an index would have been far more expensive to maintain.
Unless you know from the start that you’ll be searching hundreds of millions of records, consider starting with simple scans and only add indexing if you can’t get acceptable performance. And even then, if queries are rare and highly varied you may still do better to do the work at query time instead of ingestion time.