Spatial Indexing Strategies

Production routing systems collapse the moment a spatial predicate degrades into a full graph scan. The difference between millisecond nearest-neighbor resolution and minute-long stalls comes down to one decision: how coordinate data is mapped onto a searchable index structure that the query planner can actually seek. Pick the wrong structure — or attach it to the wrong primitive — and every distance query reads the whole label, memory grows with the node count instead of the search radius, and p99 latency spikes the instant traffic clusters. This guide shows how to choose a spatial index for a Python-driven graph workload, create it correctly, drive it from the async Neo4j driver, and keep it from fragmenting under sustained mutation. It builds on the broader concepts in Spatial Graph Database Fundamentals for Python.

Prerequisites

These examples assume an async Python stack talking to a Neo4j instance with native point support. The CREATE POINT INDEX syntax and point.distance semantics used below are stable on Neo4j 5.x; geohash/H3 work is library-side and version-independent.

Requirement	Minimum version	Notes
Python	3.10+	Union types (`dict \| None`) and `match` used in examples
Neo4j	5.13+	Native `point` type, `CREATE POINT INDEX`, index-backed range predicates
neo4j (driver)	5.x	Async driver (`AsyncGraphDatabase`)
shapely	2.0+	Client-side geometry validation before ingestion
python-geohash	0.8+	Prefix encoding for sharded grids (or `h3` 3.7+ for hex grids)

pip install "neo4j>=5.18" "shapely>=2.0" "python-geohash>=0.8" "h3>=3.7"

Before tuning indexes, confirm your graph already follows sound node and edge spatial mapping conventions — coordinates stored as native point values on traversable primitives, not as detached string properties that no index can seek.

Core Concept & Mechanism

A spatial index exists to convert “find things near here” from an O(n) scan into a bounded lookup. The three structures you will actually choose between trade off along the same axis: lookup precision versus write cost versus shardability.

R-tree / native point index. Neo4j’s POINT INDEX is an R-tree variant: a balanced tree of nested bounding boxes. It excels at range and nearest-neighbor queries over points because the planner can descend only the boxes that overlap the search window. The cost is write amplification — every insert may trigger node splits that ripple up the tree, and concurrent bulk upserts contend on those splits.
Quadtree. Recursively partitions space into four quadrants until each leaf holds at most k points. Lookups are predictable for uniform distributions and it answers polygon and multi-scale analytic queries naturally. Under dense urban clustering, though, leaves overflow and the tree fragments — depth grows where points concentrate, so latency becomes data-dependent.
Geohash / H3 grids. Encode a coordinate as a string (geohash) or hex cell id (H3). Proximity becomes shared-prefix matching, which makes these structures trivial to shard, cache, and replicate across regions — a string prefix maps cleanly to a partition key. The trade-off is geometric: cell boundaries are arbitrary, so two points either side of a boundary look “far” by prefix even when they are meters apart, and you must query neighbor cells to be correct.

The mechanism that ties all three to query speed is predicate push-down. When the planner recognizes that a WHERE clause is an index-seekable range or point predicate, it enters the graph through the index and expands only the matching subset. When it cannot — because the predicate sits after an expansion, or the property is not a native point — it falls back to a label scan plus a post-filter, and you pay for every node regardless of radius. The deep trade-off between prefix grids and recursive partitioning is dissected in Implementing Geohash vs Quadtree Indexing in Neo4j.

Use this rough decision tree to pick a primary index — it is not exhaustive, but captures the trade-offs that matter at production scale:

Schema & Data Model

The planner can only seek an index that exists, and only when the predicate shape matches the index type. The model below stores point geometry as a native point so distance filters are index-backed, carries a geohash string for shard routing, and keeps a precomputed bbox on linear features so range comparisons run before any expensive distance math.

// Native point index — backs point.distance() range/KNN predicates
CREATE POINT INDEX hub_location IF NOT EXISTS
FOR (h:Hub) ON (h.location);

// Prefix index on the geohash string — backs STARTS WITH shard routing
CREATE TEXT INDEX hub_geohash IF NOT EXISTS
FOR (h:Hub) ON (h.geohash);

// Range index on the edge bounding box corners — cheap pre-filter for segments
CREATE INDEX segment_bbox IF NOT EXISTS
FOR ()-[r:ROAD_SEGMENT]-() ON (r.bbox_min_lat, r.bbox_max_lat);

// Representative shape of the indexed spatial graph
// (:Hub {id, location: point({srid:4326, latitude, longitude}), geohash})
//   -[:ROAD_SEGMENT {bbox_min_lat, bbox_max_lat, bbox_min_lon, bbox_max_lon, length_m}]->
// (:Hub)

Point entities (delivery hubs, charging stations, IoT beacons) want a dense point index; linear features (road segments, transit corridors, pipelines) want bounding-box range indexes on their edges. Attaching a point index to a polyline forces the engine to compute point.distance after the scan, bypassing the index entirely. Which physical structure ultimately backs location — R-tree point index, geohash bucket, or H3 cell — is exactly the selectivity that your graph query planner optimization layer consumes when it costs a plan.

Step-by-Step Implementation

The workflow is: validate geometry client-side, enforce a single CRS, write the node with both a native point and a shard key, then query through a two-stage bounded predicate. We build it as runnable async code.

1. Validate and enforce a single CRS at ingestion

Malformed or mixed-CRS coordinates are the most common source of silently wrong results. Reject them before they cost a network round trip, and always pin WGS 84 (EPSG:4326) so distance math is comparable.

import asyncio
import geohash
from shapely.geometry import Point
from shapely.validation import explain_validity
from neo4j import AsyncGraphDatabase

URI = "neo4j+s://your-cluster-host:7687"
AUTH = ("neo4j", "secure-password")
POOL_CONFIG = {
    "max_connection_pool_size": 50,
    "connection_acquisition_timeout": 5.0,
    "max_transaction_retry_time": 10.0,
}


def validate_coordinate(lat: float, lon: float) -> Point:
    """Reject out-of-CRS or invalid geometry before any graph write."""
    if not (-90 <= lat <= 90) or not (-180 <= lon <= 180):
        raise ValueError(f"Coordinate outside EPSG:4326 bounds: {lat}, {lon}")
    pt = Point(lon, lat)  # shapely is (x=lon, y=lat)
    if not pt.is_valid:
        raise ValueError(f"Invalid geometry: {explain_validity(pt)}")
    return pt

2. Write the node with a native point and a shard key

The MERGE writes one canonical node; the SET populates the native point (which the point index seeks) and the geohash prefix (which the text index uses for shard routing). Precision 7 geohashes resolve to roughly 150 m cells — tune precision to your locality target.

async def ingest_spatial_node(driver, node_id: int, lat: float, lon: float):
    validate_coordinate(lat, lon)
    gh = geohash.encode(lat, lon, precision=7)

    query = """
    MERGE (n:Hub {id: $id})
    SET n.location = point({srid: 4326, latitude: $lat, longitude: $lon}),
        n.geohash = $gh,
        n.updated_at = timestamp()
    """
    async with driver.session() as session:
        await session.run(query, id=node_id, lat=lat, lon=lon, gh=gh)


async def main():
    driver = AsyncGraphDatabase.driver(URI, auth=AUTH, **POOL_CONFIG)
    try:
        await ingest_spatial_node(driver, 8842, 40.7128, -74.0060)
    finally:
        await driver.close()


if __name__ == "__main__":
    asyncio.run(main())

3. Query through a two-stage bounded predicate

The single most effective spatial query pattern is: pre-filter with a cheap bounding box the index can seek, then apply exact distance only to the survivors. In dense urban graphs this collapses the candidate set by 90–99% before any point.distance call runs.

import math


def compute_bounding_box(lat: float, lon: float, radius_km: float) -> dict:
    """Approximate degree-space bounding box on a spherical earth model."""
    R = 6371.0  # Earth radius, km
    d_lat = math.degrees(radius_km / R)
    d_lon = math.degrees(radius_km / (R * math.cos(math.radians(lat))))
    return {
        "min_lat": lat - d_lat, "max_lat": lat + d_lat,
        "min_lon": lon - d_lon, "max_lon": lon + d_lon,
    }


async def find_nearest_hubs(driver, lat: float, lon: float, radius_km: float = 5.0):
    bbox = compute_bounding_box(lat, lon, radius_km)
    query = """
    WITH point({srid: 4326, latitude: $lat, longitude: $lon}) AS target
    MATCH (hub:Hub)
    WHERE hub.location.latitude  >= $min_lat AND hub.location.latitude  <= $max_lat
      AND hub.location.longitude >= $min_lon AND hub.location.longitude <= $max_lon
    WITH hub, point.distance(hub.location, target) AS dist_m
    WHERE dist_m <= ($radius_km * 1000)
    RETURN hub.id AS id, dist_m AS distance_m
    ORDER BY dist_m ASC
    LIMIT 25
    """
    async with driver.session() as session:
        result = await session.run(query, lat=lat, lon=lon, radius_km=radius_km, **bbox)
        return [record.data() async for record in result]

Query Patterns & Variants

The same “near here” intent has several index-able shapes. Pick the one whose anchor matches how the index is structured.

Variant A — bounding box then exact distance (R-tree friendly). The default for native point indexes. The range comparison on latitude/longitude is index-seekable; the distance call only runs on the bounded survivors.

WITH point({srid: 4326, latitude: $lat, longitude: $lon}) AS target
MATCH (hub:Hub)
WHERE hub.location.latitude  >= $min_lat AND hub.location.latitude  <= $max_lat
  AND hub.location.longitude >= $min_lon AND hub.location.longitude <= $max_lon
WITH hub, point.distance(hub.location, target) AS dist_m
WHERE dist_m <= $radius_m
RETURN hub.id, dist_m ORDER BY dist_m LIMIT 50
// $min_*/$max_* come from compute_bounding_box(); never compute the box in Cypher.

Variant B — geohash prefix shard routing. When data is partitioned by region, route to the shard with a prefix seek before any geometry runs. Truncate the geohash to the precision whose cell comfortably contains your radius.

MATCH (hub:Hub)
WHERE hub.geohash STARTS WITH $cell_prefix
WITH hub, point.distance(hub.location, point($target)) AS dist_m
WHERE dist_m <= $radius_m
RETURN hub.id, dist_m ORDER BY dist_m LIMIT 50
// Query the 8 neighbor prefixes too, or points across a cell border are missed.

Variant C — KNN without a fixed radius. When the question is “the k closest” rather than “everything within r”, drop the radius guard and let ORDER BY ... LIMIT k do the work — but keep the bounding box so the index still bounds the scan. This overlaps directly with k-nearest-neighbor routing, and the predicate shapes mirror those in distance filter query patterns.

WITH point({srid: 4326, latitude: $lat, longitude: $lon}) AS target
MATCH (hub:Hub)
WHERE hub.location.latitude  >= $min_lat AND hub.location.latitude  <= $max_lat
  AND hub.location.longitude >= $min_lon AND hub.location.longitude <= $max_lon
RETURN hub.id, point.distance(hub.location, target) AS dist_m
ORDER BY dist_m ASC LIMIT $k
// Widen the box and re-run if fewer than $k rows return at the edge of coverage.

Performance Tuning

Profiling is the whole game. EXPLAIN returns the plan without running it (validate plan shape in CI); PROFILE runs the query and annotates each operator with real db hits and rows. Read the plan bottom-up and find the first operator whose rows is far larger than the final result — that is where an index or a tighter predicate belongs.

Confirm the seek, not the scan. A healthy spatial query shows a PointIndexSeek (or NodeIndexSeekByRange) at the bottom. If you see a NodeByLabelScan feeding a Filter on point.distance, the predicate is not pushing down — move it onto the anchor and verify the index covers the property.
Refresh statistics after bulk loads. Stale histograms make the planner misjudge selectivity and skip the index. Recompute after large ingestion or coordinate rewrites.
Keep plans cacheable. Always parameterize. Literal coordinates baked into the query string force recompilation and thrash the plan cache; pass $min_lat etc. as parameters with stable types.
Budget memory for the hot region. Size the page cache to hold the working set’s nodes and the point index, so seeks stay in memory. Mirror this client-side with a bounded max_connection_pool_size.
Batch writes away from reads. Run index rebuilds and bulk upserts in maintenance windows; node-split churn during heavy writes directly degrades read selectivity.

These planner-side concerns connect to the broader profiling and memory workflow in Cypher performance tuning. A practical loop: capture PROFILE, find the widest operator, add the index or tighten the predicate that narrows it, re-profile, and confirm db hits dropped.

Edge Cases & Gotchas

Mixed CRS coordinates. A geographic point({latitude, longitude}) (SRID 4326) and a Cartesian point({x, y}) (SRID 7203) are not comparable; point.distance across SRIDs returns null, and a null predicate silently drops the row rather than erroring. Normalize CRS at ingestion and assert the SRID before querying.
Geohash boundary misses. Two points meters apart can land in different cells with different prefixes. A prefix-only query will miss the neighbor — always expand to the surrounding cells (8 for a square grid, 6 for H3) before computing distance.
Coordinate precision traps. Float rounding on dense grids can make two segment endpoints “almost equal”, creating phantom dead-ends or duplicate nodes. Snap to a fixed tolerance during node and edge spatial mapping, not at query time.
Point index attached to the wrong primitive. A point index on a node does nothing for a distance predicate evaluated over edge geometry. Index the property the planner actually filters on.
Index fragmentation under churn. Uneven leaf splits after sustained writes inflate I/O and cache misses. Schedule periodic rebuilds in low-traffic windows and monitor index hit ratios via engine telemetry.
Driver timeout vs. unbounded scan. A query that falls back to a full scan will blow past connection_acquisition_timeout under load and exhaust the pool. A timeout storm during peak traffic is usually a missing-seek symptom, not a pool-size problem.

Verification & Testing

An index change is only safe if the indexed query returns the same rows as the naive one, just faster. Assert both correctness (the right hubs, in the right order) and plan shape (a seek, not a scan) — a regression that turns a seek back into a scan changes only latency, so a correctness test alone will not catch it.

import pytest
from neo4j import AsyncGraphDatabase

SEED = """
CREATE (a:Hub {id: 1, location: point({srid:4326, latitude: 47.60, longitude: -122.33})})
CREATE (b:Hub {id: 2, location: point({srid:4326, latitude: 47.62, longitude: -122.35})})
CREATE (c:Hub {id: 3, location: point({srid:4326, latitude: 47.95, longitude: -122.90})})
"""


@pytest.mark.asyncio
async def test_bounded_query_matches_bruteforce():
    driver = AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "test"))
    async with driver.session(database="neo4j") as s:
        await s.run("MATCH (n) DETACH DELETE n")
        await s.run(SEED)
        await s.run(
            "CREATE POINT INDEX hub_location IF NOT EXISTS FOR (h:Hub) ON (h.location)"
        )

        # Ground truth: brute-force distance over all hubs, no bounding box.
        truth = await (await s.run(
            """
            WITH point({srid:4326, latitude: 47.60, longitude: -122.33}) AS t
            MATCH (h:Hub)
            WITH h, point.distance(h.location, t) AS d WHERE d <= 5000
            RETURN h.id AS id ORDER BY d
            """
        )).values()

        # Indexed two-stage query under test.
        got = await (await s.run(
            """
            WITH point({srid:4326, latitude: 47.60, longitude: -122.33}) AS t
            MATCH (h:Hub)
            WHERE h.location.latitude >= 47.55 AND h.location.latitude <= 47.65
              AND h.location.longitude >= -122.40 AND h.location.longitude <= -122.28
            WITH h, point.distance(h.location, t) AS d WHERE d <= 5000
            RETURN h.id AS id ORDER BY d
            """
        )).values()

    assert got == truth, "bounded query must match brute-force result set"
    await driver.close()

Pair this with a plan-shape assertion: run EXPLAIN on the bounded query and inspect the plan from result.consume() to assert it contains a point index seek rather than a label scan.

FAQ

R-tree, geohash, or quadtree — which should I default to?

Default to Neo4j’s native point index (an R-tree) for point range and nearest-neighbor queries on a single instance — it gives index-backed point.distance with no extra moving parts. Reach for geohash or H3 when your dominant concern is sharding, cache locality, or cross-region replication, since string prefixes map cleanly to partitions. Choose a quadtree when you need polygon containment or multi-scale analytics rather than point proximity.

Why does my point.distance query still do a full label scan?

Almost always the predicate is not index-seekable as written, or the property is not a native point. Confirm location is stored as point({srid:4326, ...}), that a POINT INDEX exists on it, and that your range comparison sits before any expansion. Run PROFILE and check for a PointIndexSeekByRange at the base of the plan; a NodeByLabelScan feeding a Filter means push-down failed.

What geohash precision should I use for a given radius?

Match the cell size to your query radius so a small set of cells covers the search window. Precision 6 is roughly 1.2 km, precision 7 roughly 150 m, precision 8 roughly 38 m. Pick the precision whose cell comfortably contains your typical radius, then query that cell plus its neighbors so points near a boundary are not missed.

How do I stop a spatial index from fragmenting under heavy writes?

Separate write and read pressure: batch bulk upserts into maintenance windows so node-split churn does not collide with live queries, and schedule periodic index rebuilds during low-traffic periods. Monitor index hit ratios and leaf depth via engine telemetry, and consider a write-tolerant geohash grid if your workload is genuinely write-heavy rather than read-heavy.

Should I shard spatial data, and by what key?

Shard once a single instance can no longer hold the hot region’s nodes and index in the page cache. Geohash prefix or H3 resolution level is the natural shard key because it aligns physical storage with query locality, which minimizes cross-node hops during nearest-neighbor resolution. Avoid sharding by an unrelated key (such as ingestion time), since it scatters spatially adjacent points across partitions.

Implementing Geohash vs Quadtree Indexing in Neo4j — a hands-on comparison of prefix grids versus recursive partitioning.
R-tree vs Geohash vs Quadtree for Road Graphs — a workload-driven decision guide for picking an index on a road network.
Node and Edge Spatial Mapping — storing geometry as native points so it can be indexed.
Graph Query Planner Optimization — making the planner consume the selectivity your index exposes.
Distance Filter Query Patterns — predicate shapes that resolve against spatial indexes.
K-Nearest-Neighbor Routing — KNN resolution built on a bounded spatial scan.

This guide is part of Spatial Graph Database Fundamentals for Python.

For authoritative reference on native spatial functions and geometry standards, consult the Neo4j Cypher Spatial Functions Documentation and the Shapely Geometry Validation Manual.

Related pages

Subtopics

Siblings