Tricks for Faster Spatial Indexes

Paul Ramsey
PostGIS

One of the curious aspects of spatial indexes is that the nodes of the tree can overlap, because the objects being indexed themselves also overlap.

paulblog1That means that if you're searching an area in which two nodes overlap, you'll have to scan the contents of both nodes. For a trivial example above, that's not a big deal, but if an index has a lot of overlap, the extra work can add up to a measurable query time difference.

The PostGIS spatial index is based on a R-tree structure, which naturally tries to split nodes into children that have minimized overlap. As the index is built, it automatically tries to conform to the distribution of the data, with more nodes in dense data areas and fewer in sparse areas.

So problem solved, right?

Well no! Because the index is structuring itself based on the inputs it receives, the order of inputs, particularly the early inputs, can have a strong effect on the quality of the final index.

Sorted vs Random Inputs

For example, here's a table of roads.

CREATE EXTENSION postgis;

\i roads.sql

CREATE TABLE roads_sorted AS 
  SELECT * FROM roads ORDER BY geom;

CREATE INDEX roads_sorted_idx ON roads_sorted USING GIST (geom);

If we order the roads in a Hilbert curve (which is the default sort order for geometry) we get a table that is highly spatially autocorrelated.

paulblog2

And then if we visualize the top two levels of the index (using gevel, we can see how the pages look.

paulblog2

This is not actually a terrible tree! Mostly the nodes don't have a lot of overlap, but the shape of the level one pages (the dotted lines) isn't great, and there's some stretching, both vertically and horizontally.

Now, take the same data and randomize it.

CREATE TABLE roads_random AS 
  SELECT * FROM roads ORDER BY random();

CREATE INDEX roads_random_idx ON roads_random USING GIST (geom);

The table now has no spatial correlation at all.

paulblog4

And then if we visualize the top two levels of the index, we can see how the pages look.

paulblog5

The level one nodes are definitely better, but it's hard to see how much better the level two nodes are until you zoom in.

Here's the level one nodes from sorted input.

paulblog6

And here's the level one nodes from randomized input.

paulblog7

The nodes for the index built from randomized inputs are just much more nicely arranged to cover the area with minimized overlap.

So it's pretty. So what?

Different Trees, Different Performance

In fact, a pretty index performs faster than an ugly one.

To test, we will run a self-join of the table on itself. The query plan is a nested loop join, and the table is 92K records long, so that's 92K index lookups.

# SELECT count(*) 
  FROM roads_sorted a, roads_sorted b 
  WHERE a.geom && b.geom;

 count  
--------
 505806
(1 row)

Time: 5229.205 ms (00:05.229)

# SELECT count(*) 
  FROM roads_random a, roads_random b 
  WHERE a.geom && b.geom;

 count  
--------
 505806
(1 row)

Time: 4629.745 ms (00:04.630)

So querying on the slightly prettier index is 11% faster.

In general, spatial indexes will end up with a better shape if seeded with randomized inputs. We found the same effect in our in-memory K-D trees, and addressed it both with bulk randomization, and a more clever pre-seeding approach.

Conclusions

  • Spatial indexes are sensitive to the order of inputs, in particular highly correlated inputs will result in poorly balanced trees and/or poorly arranged trees.
  • Pre-ordering your rows randomly will result in a more balanced tree and measurable improvements in query performance.

Newsletter