Optimizing Polygon Intersections with Subdivide
As part of our effort to map out electricity coverage in the USA, we've been collecting geographic shape data for all of the electricity providers. We've also done something similar for 580 solar installers. This comes out to something like 28.2 million square miles (73 million square kilometers) and 22,707,508 individual points.
There are a number of unique calculations we use this data for, like estimating the average rate/bill in a location using a normalized average of the providers in the area based on territory covered, but a lot of simpler calculations depend on knowing what providers serve what areas (state/county/city). We can get that information by intersecting the provider/solar installer shapes with the state/county/city shapes.
I couldn't find a lot of information on the web on PostGIS optimization (beyond the typical "make an index for it"), so I decided to write up what worked for us in case anyone else is messing around with geometry intersections.
PostGIS is a Postgres (pg) extension written mostly in PLpgSQL and C. We went with it for the back-end because we're already using pg to store our other data, so it was a no-brainer to have everything in the same place. PostGIS also lets you make indices on geometries and has a bunch of useful functions for querying shape data, like ST_Intersects to see if two shapes intersect and ST_Intersection to produce a shape that is the intersection of two other shapes.
For the stats seen in this article, we're using pg 13 and PostGIS 3.1.2 (big upgrade from 2 since queries can be parallelized), on a server with 32 GB of RAM, 12 CPUs, 4 TB HDD, and running Ubuntu 22.04. In order to configure the pg database server, I used this online config generator as a base, then did some adjustments based on observations while it was running. The end result was:
max_connections = 100
shared_buffers = 6GB
effective_cache_size = 18GB
maintenance_work_mem = 1536MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 4
effective_io_concurrency = 2
work_mem = 7864kB
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 12
max_parallel_workers_per_gather = 4
max_parallel_workers = 12
max_parallel_maintenance_workers = 4
This is by no means an optimal configuration for this application, but it's decent enough for testing.
The naive approach is to write a query that looks something like below (this one is for finding the states served by electricity providers).
EXPLAIN ANALYZESELECTp.provider_id,s.id state_id, ST_AREA(ST_INTERSECTION(s.shape_4326, p.shape_4326)) / ST_AREA(s.shape_4326) AS state_intersection_fraction, ST_AREA(ST_INTERSECTION(s.shape_4326, p.shape_4326)) / ST_AREA(p.shape_4326) AS provider_intersection_fraction, ST_INTERSECTION(s.shape_4326, p.shape_4326) AS shape_4326FROM provider_coverages p INNER JOINstate s ON (ST_INTERSECTS(s.shape_4326, p.shape_4326) AND ST_AREA(ST_INTERSECTION(s.shape_4326, p.shape_4326)) / ST_AREA(p.shape_4326) >0.01)WHEREp.shape_4326ISNOTNULL;
First we check if the state shape intersects with the provider shape, then we check that the area of that intersection is larger than 1% of the total state shape. This is to get rid of any rows where the electricity provider might have their coverage shape running along a state border. Even if you cut one shape from another using ST_Difference, there's a chance you'll still get tiny intersections because we're mapping continuous values to discrete ones and there's always a margin of error. I'd recommend running your test queries with some LIMIT set initially, to speed up the iteration process.
The above query gives the following result.
Gather (cost=1000.14..1992385.32 rows=42207 width=56) (actual time=6283.523..1161123.584 rows=3580 loops=1) Workers Planned: 1 Workers Launched: 1-> Nested Loop (cost=0.14..1987164.62 rows=24828 width=56) (actual time=3252.036..686003.354 rows=1790 loops=2)-> Parallel Seq Scan on provider_coverages p (cost=0.00..1745.97 rows=2197 width=89586) (actual time=178.948..182.021 rows=1534 loops=2)Filter: (shape_4326 ISNOTNULL)-> Index Scan using state_shape_4326_idx on state s (cost=0.14..50.41 rows=1 width=277576) (actual time=48.476..164.418 rows=1 loops=3068) Index Cond: (shape_4326 && p.shape_4326)Filter: (st_intersects(shape_4326, p.shape_4326) AND ((st_area(st_intersection(shape_4326, p.shape_4326, '-1'::double precision)) /st_area(p.shape_4326)) >'0.01'::double precision)) Rows Removed by Filter: 1Planning Time: 0.122 msJIT:Functions: 20Options: Inlining true, Optimization true, Expressions true, Deforming trueTiming: Generation 2.219 ms, Inlining 50.929 ms, Optimization 179.891 ms, Emission 126.596 ms, Total 359.636 msExecution Time: 1161126.203 ms
Which we can then import to this site to get some nice visuals. This took around 20 minutes. Unfortunately there's nothing here that seems to be a quick fix; most of the time is spent on an index scan which I'm not really sure how one would optimize other than messing with hardware configs. Regardless, let's see what effect subdivision has on performance.
ST_Subdivide divides a bigger geometry into smaller parts that have less than some specified number of vertices.
Becomes this, when vertex limit is set to 10.
This speeds up intersection calculations, because with a geometric index, PostGIS can determine faster whether or not shapes intersect. The general flow is
subdivide original shapes -> create indices -> check if intersects -> aggregate all intersected rows into full intersection shape -> insert into table
You do lose some time on creating the subdivided shapes and generating indices, but even including the time spent on that, there can be drastic time reductions compared to the original query. However, it's not really worth doing if all of your shapes are already fairly small. To get a quick idea of whether or not it would work for our data, I checked how many points ours shapes have.
SELECTcount(*) FILTER (WHERE st_npoints(shape_4326) >100000) AS more_than_100k_points,count(*) FILTER (WHERE st_npoints(shape_4326) >10000AND st_npoints(shape_4326) <100000) AS more_than_10k_less_than_100k,count(*) FILTER (WHERE st_npoints(shape_4326) <10000AND st_npoints(shape_4326) >1000) AS more_than_1k_less_than_10k,count(*) FILTER (WHERE st_npoints(shape_4326) <1000) AS less_than_1kFROMstateWHERE shape_4326 ISNOTNULL;SELECTcount(*) FILTER (WHERE st_npoints(shape_4326) >100000) AS more_than_100k_points,count(*) FILTER (WHERE st_npoints(shape_4326) >10000AND st_npoints(shape_4326) <100000) AS more_than_10k_less_than_100k,count(*) FILTER (WHERE st_npoints(shape_4326) <10000AND st_npoints(shape_4326) >1000) AS more_than_1k_less_than_10k,count(*) FILTER (WHERE st_npoints(shape_4326) <1000) AS less_than_1kFROM provider_coverages;
For states there are 0 states with more than 100k points, 38 between 10k and 100k, 14 between 1k and 10k, and 4 under 1k. For providers, it's 47, 151, 576, and 2292 in that same order. Looks like the majority of the shapes are sub 1k, but there are clearly quite a few fairly large shapes that might take up a lot of the processing time. To test this, I ran the first intersection query again, but eliminated any providers with shapes that had more than 1k points. The result? It finished in 16 seconds. Maybe subdiving the larger shapes into smaller ones will speed things up? Spoiler, it does.
Generating the subdivided tables and indices will look something like this:
-- statesCREATE MATERIALIZED VIEW state_shape_4326_sub_divide ASSELECT id, ST_Subdivide(state.shape_4326) AS shape_4326FROMstateWHEREstate.shape_4326 ISNOTNULL;CREATEINDEXON state_shape_4326_sub_divide (id);CREATEINDEXON state_shape_4326_sub_divide USING gist (shape_4326);-- providersCREATE MATERIALIZED VIEW provider_electricity_coverage_shape_4326_sub_divide ASSELECT provider_id AS id, ST_Subdivide(provider_coverages.shape_4326) AS electricity_coverage_shape_4326FROM provider_coverages;CREATEINDEXON provider_electricity_coverage_shape_4326_sub_divide (id);CREATEINDEXON provider_electricity_coverage_shape_4326_sub_divide USING gist (electricity_coverage_shape_4326);
I'm using materialized views so that I can just refresh them whenever data is updated, instead of deleting and recreating, but performance-wise it should be identical to normal tables. Creating the state materialized view takes about 10 seconds and the provider one takes around two minutes. The provider materialized view is going to be used for the county and city intersection checks as well. The indices are pretty much instant. When no secocnd value is specificed in ST_Subdivide, like above, it defaults to the shapes having a maximum of 256 vertices.
The query now has to be updated to first query the subdivided tables and aggregate the intersections between state and provider into one intersection shape for each provider/state ID combo. Then that result can be queried to find which values need to be filtered out due to 0% intersection area errors. You can see the implementation below.
EXPLAIN ANALYZEWITH p_to_s AS (SELECTs.idAS state_id,p.idAS provider_id, st_union(ST_INTERSECTION (s.shape_4326, p.electricity_coverage_shape_4326)) AS intersectionFROM state_shape_4326_sub_divide sINNER JOIN provider_electricity_coverage_shape_4326_sub_divide pON ST_INTERSECTS (s.shape_4326, p.electricity_coverage_shape_4326)ANDNOT ST_TOUCHES (s.shape_4326, p.electricity_coverage_shape_4326)GROUP BYs.id,p.id)SELECTps.provider_id,ps.state_id, ST_AREA(ps.intersection) / ST_AREA(s.shape_4326) AS state_intersection_fraction, ST_AREA (ps.intersection) / ST_AREA(p.shape_4326) AS provider_intersection_fraction, ps.intersectionAS shape_4326FROM p_to_s psINNER JOINstate s ONs.id=ps.state_idINNER JOIN provider_coverages p ONp.provider_id=ps.provider_idWHERE ST_AREA(ST_INTERSECTION(s.shape_4326, p.shape_4326)) / ST_AREA(p.shape_4326) >0.01;
I prefer using CTEs to nesting the query because a) the performance is the same as long as you're above pg 10, and b) I think it's clearer than having a nested query. Anyway the result above produces this output (nice visual results here)
Nested Loop (cost=6775672.86..240044434.58 rows=7720 width=56) (actual time=9148.069..662349.160 rows=3580 loops=1) Join Filter: ((st_area(st_intersection(c.shape_4326, p.shape_4326, '-1'::double precision)) /st_area(p.shape_4326)) >'0.01'::double precision) Rows Removed by Join Filter: 1763-> Merge Join (cost=6775672.58..239447967.90 rows=23161 width=277612) (actual time=9146.823..180331.450 rows=5343 loops=1) Merge Cond: (c_1.id = c.id)->GroupAggregate (cost=6775672.44..239446739.46 rows=78512 width=40) (actual time=9146.810..180325.509 rows=5343 loops=1) Group Key: c_1.id, sp.id-> Gather Merge (cost=6775672.44..7869690.96 rows=9137026 width=3467) (actual time=9145.263..9676.613 rows=419217 loops=1) Workers Planned: 4 Workers Launched: 4->Sort (cost=6774672.38..6780383.02 rows=2284256 width=3467) (actual time=9122.957..9222.123 rows=83843 loops=5) Sort Key: c_1.id, sp.id Sort Method: external merge Disk: 291544kB Worker 0: Sort Method: external merge Disk: 301872kB Worker 1: Sort Method: external merge Disk: 301576kB Worker 2: Sort Method: external merge Disk: 303304kB Worker 3: Sort Method: external merge Disk: 298600kB-> Nested Loop (cost=0.15..2536634.84 rows=2284256 width=3467) (actual time=114.248..8893.441 rows=83843 loops=5)-> Parallel Seq Scan on provider_electricity_coverage_shape_4326_sub_divide sp (cost=0.00..44310.59 rows=49659 width=1463) (actual time=0.008..32.276 rows=39727 loops=5)-> Index Scan using state_shape_4326_sub_divide_shape_4326_idx on state_shape_4326_sub_divide c_1 (cost=0.15..50.18 rows=1 width=2004) (actual time=0.097..0.219 rows=2 loops=198636) Index Cond: (shape_4326 && sp.electricity_coverage_shape_4326)Filter: (st_intersects(shape_4326, sp.electricity_coverage_shape_4326) AND (NOTst_touches(shape_4326, sp.electricity_coverage_shape_4326))) Rows Removed by Filter: 0-> Index Scan using state_pkey on state c (cost=0.14..15.28 rows=59 width=277576) (actual time=0.009..0.181 rows=59 loops=1)-> Index Scan using provider_coverages_pkey on provider_coverages p (cost=0.28..0.32 rows=1 width=89586) (actual time=0.005..0.005 rows=1 loops=5343) Index Cond: (provider_id = sp.id)Planning Time: 4.441 msJIT:Functions: 55Options: Inlining true, Optimization true, Expressions true, Deforming trueTiming: Generation 9.399 ms, Inlining 186.708 ms, Optimization 238.748 ms, Emission 142.896 ms, Total 577.751 msExecution Time: 662386.005 ms
A lot more going on in this query, but what jumps out is that it only took 57% of the time of the previous one. In terms of obvious optimizations, we're having a sort happen on disk, which will be slower than doing it in memory. However, to do that we'd have to increase the work_mem parameter to just over 300 MB, which is far too high for the hardware. work_mem is the memory available for certain operations, but it isn't set per query. Instead it's per hash/sort operation, which is a lot harder to predict and manage, since each query can have multiple sorts/hash operations. Setting it too high can invoke the legendary OOM killer, so I'm not really going to bother messing with it.
Apparently the default option for ST_Subdivide isn't the optimal for every situation (what a surprise /s), so I tried running the queries with it set to 256, 60, and 10, for both providers and states (I kept one at 256 while I changed the other). The results were disappointing.
256 vertices, 659354.127 ms
60 vertices, 683114.961 ms
10 vertices, 1003338.652 ms
256 vertices, 659354.127 ms
60 vertices, 710537.284 ms
10 vertices, 1161937.448 ms
Turns out the default option was optimal for this situation.
EDIT: I was going to end with the above, but I just realized that I could increase the max number of vertices to more than 256! For some reason I had in my head that 256 was the maximum, but it's not. It turns out it doesn't matter, both 512 and 1024 were the same/slightly slower. Looks like 256 is just the sweet spot!