One of the curious aspects of spatial indexes is that the nodes of the tree can overlap, because the objects being indexed themselves also overlap.

That means that if you're searching an area in which two nodes overlap, you'll have to scan the contents of both nodes. For a trivial example above, that's not a big deal, but if an index has a lot of overlap, the extra work can add up to a measurable query time difference.

The PostGIS spatial index is based on a R-tree structure, which naturally tries to split nodes into children that have minimized overlap. As the index is built, it automatically tries to conform to the distribution of the data, with more nodes in dense data areas and fewer in sparse areas.

So problem solved, right?

Well no! Because the index is structuring itself based on the inputs it receives, the order of inputs, particularly the early inputs, can have a strong effect on the quality of the final index.

## Sorted vs Random Inputs

For example, here's a table of roads.

CREATE EXTENSION postgis;

SELECT * FROM roads ORDER BY geom;

If we order the roads in a Hilbert curve (which is the default sort order for geometry) we get a table that is highly spatially autocorrelated.

And then if we visualize the top two levels of the index (using gevel, we can see how the pages look.

This is not actually a terrible tree! Mostly the nodes don't have a lot of overlap, but the shape of the level one pages (the dotted lines) isn't great, and there's some stretching, both vertically and horizontally.

Now, take the same data and randomize it.

SELECT * FROM roads ORDER BY random();

The table now has no spatial correlation at all.

And then if we visualize the top two levels of the index, we can see how the pages look.

The level one nodes are definitely better, but it's hard to see how much better the level two nodes are until you zoom in.

Here's the level one nodes from sorted input.

And here's the level one nodes from randomized input.

The nodes for the index built from randomized inputs are just much more nicely arranged to cover the area with minimized overlap.

So it's pretty. So what?

## Different Trees, Different Performance

In fact, a pretty index performs faster than an ugly one.

To test, we will run a self-join of the table on itself. The query plan is a nested loop join, and the table is 92K records long, so that's 92K index lookups.

# SELECT count(*)
WHERE a.geom && b.geom;

count
--------
505806
(1 row)

Time: 5229.205 ms (00:05.229)

# SELECT count(*)
WHERE a.geom && b.geom;

count
--------
505806
(1 row)

Time: 4629.745 ms (00:04.630)

So querying on the slightly prettier index is 11% faster.

In general, spatial indexes will end up with a better shape if seeded with randomized inputs. We found the same effect in our in-memory K-D trees, and addressed it both with bulk randomization, and a more clever pre-seeding approach.

## Conclusions

• Spatial indexes are sensitive to the order of inputs, in particular highly correlated inputs will result in poorly balanced trees and/or poorly arranged trees.
• Pre-ordering your rows randomly will result in a more balanced tree and measurable improvements in query performance.