Spatial Join Techniques for Production Graph Networks

Q: Why does my spatial join blow up to O(n squared) even with an index?

Because the inner MATCH is not being seeked. Two common causes: the driving node never binds (wrong id or label) so the inner label runs unanchored against every node; or the proximity predicate is only point.distance() with no bounding-box range comparison ahead of it. Add the four-corner box on location.latitude and location.longitude, anchor the driver by a unique constrained id, and confirm a POINT INDEX exists on the inner label. PROFILE should show PointIndexSeekByRange, not NodeByLabelScan feeding a Filter.

Q: Should I use CREATE or MERGE for the join edges?

Use MERGE in any pipeline that can retry. CREATE writes a fresh SERVES edge every run, so a transient failure that triggers a re-run leaves duplicate edges that corrupt downstream counts and routing weights. MERGE is idempotent, updating distance_m on the existing edge instead. The small cost is a uniqueness check per pair, which is negligible next to the distance math.

Q: Plain Cypher join or GDS KNN, which should I reach for?

Use the two-phase Cypher join for incremental, driver-supplied, exact-radius work where you control batching and want great-circle meters. Use gds.knn for periodic full-graph rebuilds where you want the k-nearest across every node in one parallel pass and can tolerate a similarity approximation. GDS operates on a projected snapshot and on a similarity metric, not live meters, so validate its output against point.distance() whenever exact radii matter.

Q: How big should each join batch be?

Start at roughly 1,000 driving nodes per transaction and tune from the transaction-log growth and peak heap you observe. Larger batches amortize network round trips but hold index locks longer and raise memory pressure; smaller batches release locks sooner and keep memory flat at the cost of more round trips. Because the join is a write, the right batch size depends on output cardinality, not just input count.

Q: My join misses points near the date line. What is wrong?

Your bounding box straddles the plus-or-minus 180 degree antimeridian, so it has min_lon greater than max_lon and the simple range matches nothing. Detect the wrap when you compute the box in Python and split it into two predicates joined by OR, one running up to plus 180 and one from minus 180. The same care applies near the poles, where the cosine-of-latitude longitude widening must be clamped to plus-or-minus 180 rather than dividing by a near-zero cosine.

A spatial join correlates two sets of geometry-bearing nodes by a proximity predicate — attaching delivery points to the hubs that serve them, snapping sensor readings to the road segment they sit on, binding incidents to the zones that contain them. Done naively in a graph database, it is the single most reliable way to take a healthy cluster down: the planner pairs every node on the left with every node on the right, the result set explodes quadratically, the transaction log balloons, and the JVM heap or native page cache is exhausted before the join ever finishes. The failure is silent in staging — a few thousand nodes join fine — and catastrophic the first time a real metro-scale dataset lands. This guide shows how to build spatial joins that stay index-bound: how the two-phase probe works internally, how to model the data so the index can seek it, the async Python that drives the join in bounded batches, the query variants you will actually reach for, and the precision and cardinality traps that corrupt results or melt memory. It is one of the core techniques in Cypher Spatial Queries & Pathfinding Patterns.

Prerequisites

These examples assume an async Python service talking to a Neo4j instance with native point support. The point.distance() semantics and index-backed range predicates are stable on Neo4j 5.x; the bounding-box arithmetic is pure client-side Python and version-independent. The optional Graph Data Science (GDS) path in the variants section needs the GDS plugin installed on the server.

Requirement	Minimum version	Notes
Python	3.10+	Union types and `dataclass(frozen=True)` used in examples
Neo4j	5.13+	Native `point` type, `CREATE POINT INDEX`, index-backed range predicates
neo4j (driver)	5.x	Async driver (`AsyncGraphDatabase`), native point serialization
Neo4j GDS	2.6+	Only for the `gds.knn` join variant; optional
pytest / pytest-asyncio	0.23+	For the correctness assertions in the testing section

pip install "neo4j>=5.18" "pytest>=8.0" "pytest-asyncio>=0.23"

A spatial join only stays cheap if both sides of the correlation are modelled for it. That means coordinates stored as native point values on the primitives you actually probe — the convention covered in node and edge spatial mapping — and the right spatial indexing strategy backing the location property on each label. Without an index on at least the inner (probed) side, every join below collapses to a full label scan no matter how tight the predicate reads.

Core Concept & Mechanism

A spatial join in a graph database is fundamentally different from a raster overlay or a PostGIS ST_DWithin table join. There is no intermediate result table and no materialized geometry layer; the engine resolves the proximity predicate directly against node properties and writes the correlation back as relationships. That makes the join a graph-write operation, and it inherits all of graph write economics: every surviving pair becomes an edge, so cardinality is the variable that dictates whether the operation costs megabytes or gigabytes.

Neo4j stores geography with the native point() type, which defaults to the WGS 84 ellipsoid (SRID 4326) for latitude/longitude coordinates. The point.distance() function returns the great-circle distance in meters between two such points. The trap is identical to the one in distance filter query patterns: point.distance() is a computed function, not an indexable property. Used alone in a WHERE, it gives the planner no seekable range, so it falls back to a label scan and evaluates the function once per candidate. In a join that means once per pair — the dreaded O(L × R) Cartesian product across the two labels.

The mechanism that defuses this is a two-phase probe. For each node on the driving (outer) side, phase one constrains the inner side with a coordinate-aligned bounding box — four range comparisons on location.latitude and location.longitude that the native point index (an R-tree variant) seeks directly via PointIndexSeekByRange. Phase two applies exact point.distance() only to the bounded survivors, clipping the square box corners back to a true circle. In a dense metro graph this collapses the per-driver candidate set by 90–99% before a single trigonometric call runs, which is the difference between an index seek and a full scan repeated for every outer node.

What makes phase one seekable is predicate push-down: the planner recognizes the bounding-box range comparison as index-descendable and enters the inner label through a seek rather than a scan. The deeper plan-selection and cost-model details live in graph query planner optimization; for a join, the operative rule is simply that the box must be expressed as plain range comparisons on the indexed coordinate components — anything computed per-row defeats the seek.

Schema & Data Model

Both labels participating in the join need a native point on the property you probe, and both should carry an indexed identity property so you can assert correctness and re-run the join idempotently. Store coordinates as point({latitude, longitude}) rather than detached lat/lon numbers — only the native type is index-seekable, and only it normalizes onto the WGS 84 CRS automatically.

// Identity + uniqueness on both sides of the join
CREATE CONSTRAINT hub_id IF NOT EXISTS
FOR (h:LogisticsHub) REQUIRE h.id IS UNIQUE;

CREATE CONSTRAINT point_id IF NOT EXISTS
FOR (p:DeliveryPoint) REQUIRE p.id IS UNIQUE;

// Point indexes on BOTH labels: the inner side must be seekable; the outer
// side benefits when the join is driven from a sub-region rather than all rows.
CREATE POINT INDEX hub_location IF NOT EXISTS
FOR (h:LogisticsHub) ON (h.location);

CREATE POINT INDEX point_location IF NOT EXISTS
FOR (p:DeliveryPoint) ON (p.location);

The SERVES relationship is the join output. Give it a distance_m property so downstream routing can rank by proximity without recomputing, and decide deliberately whether the join is one-to-many (each point served by its single nearest hub) or many-to-many (each point linked to every hub within radius). The cardinality of that decision is the dominant cost driver, not the geometry. For correlating graph nodes against geometry that arrives from outside the database — a fresh OSM extract or a vendor feed — the same probe is the join step at the tail of OSM data ingestion pipelines and POI enrichment workflows.

Step-by-Step Implementation

The implementation drives the join from Python: compute bounding boxes client-side, stream the inner geometry in bounded batches, and let the engine seek. Synchronous execution in one monolithic transaction is the anti-pattern — it holds locks on the spatial index for the whole run and exhausts the connection pool under any concurrency.

Step 1 — Compute the bounding box in Python, not Cypher. A box derived inside Cypher with per-row trigonometry cannot be pushed down. Compute the four corners client-side from a center and radius, then pass them as scalar parameters so the range predicate stays index-seekable and the plan stays cacheable.

from math import radians, cos
from dataclasses import dataclass

EARTH_RADIUS_M = 6_371_000.0

@dataclass(frozen=True)
class BBox:
    min_lat: float
    max_lat: float
    min_lon: float
    max_lon: float

def bounding_box(lat: float, lon: float, radius_m: float) -> BBox:
    """Latitude/longitude envelope that fully contains a radius_m circle.

    Latitude degrees are ~constant length; longitude degrees shrink by
    cos(latitude), so the lon delta must be widened near the poles.
    """
    lat_delta = (radius_m / EARTH_RADIUS_M) * (180.0 / 3.141592653589793)
    lon_delta = lat_delta / max(cos(radians(lat)), 1e-12)
    return BBox(lat - lat_delta, lat + lat_delta, lon - lon_delta, lon + lon_delta)

Step 2 — Express the join as a two-phase Cypher query. Phase one filters the inner label by the box (index seek); phase two clips with point.distance(). The UNWIND lets one round trip drive many outer nodes, amortizing latency.

UNWIND $drivers AS d
MATCH (hub:LogisticsHub {id: d.hub_id})
// Phase 1: index-seekable bounding-box pre-filter on the inner label
MATCH (c:DeliveryPoint)
WHERE c.location.latitude  >= d.min_lat AND c.location.latitude  <= d.max_lat
  AND c.location.longitude >= d.min_lon AND c.location.longitude <= d.max_lon
  AND c.status = 'active'
// Phase 2: exact great-circle distance only on the bounded survivors
WITH hub, c, point.distance(hub.location, c.location) AS dist_m
WHERE dist_m <= d.max_distance_m
MERGE (hub)-[s:SERVES]->(c)
SET   s.distance_m = dist_m

MERGE rather than CREATE makes the join idempotent — re-running it updates distance_m instead of duplicating edges, which matters when the pipeline retries after a transient failure.

Step 3 — Drive it from the async driver in bounded batches. Each batch is its own transaction so locks are released frequently and memory stays flat. The driver manages the connection lifecycle; you own the batching and the bounding-box precomputation.

import asyncio
import neo4j
from typing import Iterable
from dataclasses import dataclass

@dataclass(frozen=True)
class SpatialJoinConfig:
    uri: str
    auth: tuple[str, str]
    batch_size: int = 1_000
    max_distance_m: float = 5_000.0
    pool_size: int = 12

JOIN_CYPHER = """
UNWIND $drivers AS d
MATCH (hub:LogisticsHub {id: d.hub_id})
MATCH (c:DeliveryPoint)
WHERE c.location.latitude  >= d.min_lat AND c.location.latitude  <= d.max_lat
  AND c.location.longitude >= d.min_lon AND c.location.longitude <= d.max_lon
  AND c.status = 'active'
WITH hub, c, point.distance(hub.location, c.location) AS dist_m
WHERE dist_m <= d.max_distance_m
MERGE (hub)-[s:SERVES]->(c)
SET   s.distance_m = dist_m
RETURN count(s) AS edges
"""

def _driver_rows(hubs: list[dict], cfg: SpatialJoinConfig) -> list[dict]:
    rows = []
    for h in hubs:
        box = bounding_box(h["lat"], h["lon"], cfg.max_distance_m)
        rows.append({
            "hub_id": h["id"],
            "min_lat": box.min_lat, "max_lat": box.max_lat,
            "min_lon": box.min_lon, "max_lon": box.max_lon,
            "max_distance_m": cfg.max_distance_m,
        })
    return rows

async def run_spatial_join(cfg: SpatialJoinConfig, hubs: list[dict]) -> int:
    driver = neo4j.AsyncGraphDatabase.driver(
        cfg.uri, auth=cfg.auth, max_connection_pool_size=cfg.pool_size
    )
    total = 0
    try:
        for i in range(0, len(hubs), cfg.batch_size):
            batch = _driver_rows(hubs[i:i + cfg.batch_size], cfg)
            async with driver.session() as session:
                summary = await session.execute_write(_apply_batch, batch)
                total += summary
    finally:
        await driver.close()
    return total

async def _apply_batch(tx: neo4j.AsyncManagedTransaction, drivers: list[dict]) -> int:
    result = await tx.run(JOIN_CYPHER, drivers=drivers)
    record = await result.single()
    return record["edges"] if record else 0

Using session.execute_write rather than a hand-rolled begin_transaction gives you the driver’s built-in retry on transient (deadlock, leader-switch) errors for free, while still bounding each unit of work to one batch.

Query Patterns & Variants

Three join shapes cover almost every production need. Pick by the cardinality you actually want, because that — not syntax — sets the cost.

Variant A — Radius (many-to-many). Every inner node within max_distance_m of a driver becomes an edge. This is the query in the implementation above. Use it for coverage and service-zone modelling where one point legitimately belongs to several hubs. Watch cardinality: in a dense center a single hub can match tens of thousands of points, so always pair it with a sane radius and a status predicate.

Variant B — Nearest-one (one-to-many). Snap each inner node to its single closest driver — the canonical “assign each delivery to its nearest hub” join. Keep the bounding box for the seek, then order and limit per driver group.

UNWIND $drivers AS d
MATCH (hub:LogisticsHub {id: d.hub_id})
MATCH (c:DeliveryPoint)
WHERE c.location.latitude  >= d.min_lat AND c.location.latitude  <= d.max_lat
  AND c.location.longitude >= d.min_lon AND c.location.longitude <= d.max_lon
WITH c, hub, point.distance(hub.location, c.location) AS dist_m
ORDER BY dist_m ASC
WITH c, head(collect({hub: hub, dist: dist_m})) AS nearest
MERGE (nearest.hub)-[s:SERVES]->(c)
SET   s.distance_m = nearest.dist

Note the collect + head idiom selects the minimum per inner node without a correlated subquery. For a true k-nearest assignment (closest k hubs per point), this is exactly the boundary where you switch to the dedicated k-nearest-neighbor routing technique rather than over-extending a join.

Variant C — GDS KNN join. When you need k-nearest across the entire graph at once rather than from a hand-supplied driver set, project both labels and let GDS compute a similarity join on the coordinate vector. This trades the per-batch control of the Cypher path for one bulk parallel pass — appropriate for periodic full rebuilds, not incremental updates.

CALL gds.graph.project(
  'serve-join',
  ['LogisticsHub', 'DeliveryPoint'],
  '*',
  { nodeProperties: ['embedding'] }   // [latitude, longitude] as a 2-vector
)
YIELD graphName;

CALL gds.knn.write('serve-join', {
  nodeProperties: ['embedding'],
  topK: 3,
  writeRelationshipType: 'SERVES',
  writeProperty: 'similarity',
  sampleRate: 0.8
})
YIELD relationshipsWritten;

GDS KNN works on a similarity metric over the property vector, not great-circle meters, so it is an approximation of geographic nearest unless you convert results back to distance. Treat it as a fast first pass and validate against point.distance() if exact radii matter.

Performance Tuning

Profile every join before trusting it. Run PROFILE on a representative batch and read the plan from the bottom up: the inner-label access must be a PointIndexSeekByRange (or PointIndexSeekByPrefix). If you see NodeByLabelScan feeding a Filter, push-down failed — the usual causes are a missing point index, coordinates stored as raw numbers instead of point, or a bounding box computed per-row in Cypher instead of passed as parameters. This is the same PROFILE-driven loop documented in Cypher performance tuning.

The cost of a correctly index-bound join is dominated by output cardinality. For a radius join the expected work per driver scales with the candidate count inside the box, roughly:

$$ C_\text{driver} \approx \rho \cdot \pi r^2 \cdot \frac{4}{\pi} = \rho \cdot 4 r^2 $$

where $\rho$ is inner-node density (nodes per m²) and $r$ is the radius — the box is $\frac{4}{\pi}$ larger than the inscribed circle, which is why the phase-two distance clip removes roughly 21% of box survivors. Total join cost is $C_\text{driver}$ summed over all drivers, so halving the radius quarters the work. The practical levers:

Right-size the batch. Start at 1,000 drivers per transaction and tune against transaction-log growth. Larger batches amortize round-trip latency but hold locks longer and raise peak heap.
Keep R-tree leaves resident. Spatial joins are page-cache hungry; size server.memory.pagecache.size to hold the inner label’s index, and watch dbms.memory.heap.max_size during bulk runs.
Partition drivers by geography. Grouping batches by region keeps each transaction’s index reads spatially local, reducing page-cache churn — the same partitioning that high-throughput ingestion uses.
Add a selective property predicate early. A status = 'active' or tenant_id filter on the inner label shrinks survivors before the distance call and, if backed by its own index, can intersect with the point seek.

Edge Cases & Gotchas

Antimeridian and pole wrap. A bounding box that straddles ±180° longitude produces min_lon > max_lon, and the naive range predicate silently returns nothing. Detect the wrap in Python and split into two boxes (>= min_lon OR <= max_lon). Near the poles, the cos(latitude) longitude widening blows up — clamp the longitude delta to ±180° rather than dividing by a near-zero cosine.
CRS drift. Mixing coordinate reference systems is the classic silent corruptor. If some nodes were ingested as Cartesian point({x, y}) and others as WGS 84 point({latitude, longitude}), point.distance() either errors or returns meaningless values. Normalize everything to EPSG:4326 at ingestion, conforming to the OGC Simple Features specification.
Coordinate precision traps. Storing coordinates as truncated floats (5 decimal places ≈ 1.1 m) is usually fine, but down-casting to float32 somewhere in the Python path introduces meter-scale jitter that flips edges in and out near the radius boundary. Keep the full float64 precision the driver gives you.
Cartesian explosion from a missing anchor. If the driving MATCH fails to bind a single hub (typo’d label, wrong id), the inner MATCH runs unanchored against the whole label — the exact O(L × R) blow-up the pattern exists to prevent. Always anchor the driver by a unique, constrained id.
GDS projection staleness. A projected graph is a snapshot. Edges written by a Cypher join after projection are invisible to a subsequent gds.knn run, and points added after projection are missing entirely. Re-project before each GDS pass, and drop the named graph afterward to free heap.

Verification & Testing

A spatial join is only correct if every written edge genuinely satisfies the predicate and no qualifying pair was missed. Assert both directions: recompute distance for a sample of written edges in Python and confirm it is within radius, and spot-check that a known close pair actually produced an edge.

import math
import pytest
import neo4j

def haversine_m(a: tuple[float, float], b: tuple[float, float]) -> float:
    R = 6_371_000.0
    (lat1, lon1), (lat2, lon2) = a, b
    dlat, dlon = math.radians(lat2 - lat1), math.radians(lon2 - lon1)
    h = (math.sin(dlat / 2) ** 2
         + math.cos(math.radians(lat1)) * math.cos(math.radians(lat2))
         * math.sin(dlon / 2) ** 2)
    return R * 2 * math.asin(math.sqrt(h))

@pytest.mark.asyncio
async def test_every_served_edge_within_radius(driver: neo4j.AsyncDriver):
    max_m = 5_000.0
    query = """
    MATCH (h:LogisticsHub)-[s:SERVES]->(c:DeliveryPoint)
    RETURN h.location.latitude AS hlat, h.location.longitude AS hlon,
           c.location.latitude AS clat, c.location.longitude AS clon,
           s.distance_m AS stored
    LIMIT 5000
    """
    async with driver.session() as session:
        result = await session.run(query)
        async for r in result:
            recomputed = haversine_m((r["hlat"], r["hlon"]), (r["clat"], r["clon"]))
            # No edge should exceed the radius...
            assert recomputed <= max_m + 1.0
            # ...and the stored distance must match the geometry within 0.5%.
            assert abs(recomputed - r["stored"]) <= 0.005 * recomputed

For completeness, count expected versus actual edges on a small fixture where you know the answer by hand, and assert no duplicate SERVES edges exist between any pair (MATCH (h)-[s:SERVES]->(c) WITH h, c, count(s) AS n WHERE n > 1 RETURN count(*) must return 0). The duplicate check is what catches a CREATE that should have been a MERGE.

FAQ

Why does my spatial join blow up to O(n²) even with an index?

Because the inner MATCH is not being seeked. Two common causes: the driving node never binds (wrong id or label), so the inner label runs unanchored against every node; or the proximity predicate is only point.distance(...) <= r with no bounding-box range comparison ahead of it. Add the four-corner box on location.latitude/location.longitude, anchor the driver by a unique constrained id, and confirm a POINT INDEX exists on the inner label. PROFILE should show PointIndexSeekByRange, not NodeByLabelScan feeding a Filter.

Should I use CREATE or MERGE for the join edges?

Use MERGE in any pipeline that can retry. CREATE writes a fresh SERVES edge every run, so a transient failure that triggers a re-run leaves duplicate edges that corrupt downstream counts and routing weights. MERGE is idempotent — it updates distance_m on the existing edge instead. The small cost is a uniqueness check per pair, which is negligible next to the distance math.

Plain Cypher join or GDS KNN — which should I reach for?

Use the two-phase Cypher join for incremental, driver-supplied, exact-radius work where you control batching and want great-circle meters. Use gds.knn for periodic full-graph rebuilds where you want the k-nearest across every node in one parallel pass and can tolerate a similarity approximation. GDS operates on a projected snapshot and on a similarity metric, not live meters, so validate its output against point.distance() whenever exact radii matter.

How big should each join batch be?

Start at roughly 1,000 driving nodes per transaction and tune from the transaction-log growth and peak heap you observe. Larger batches amortize network round trips but hold index locks longer and raise memory pressure; smaller batches release locks sooner and keep memory flat at the cost of more round trips. Because the join is a write, the batch size that works depends on output cardinality, not just input count — a radius join in a dense center may need far smaller batches than the same query in a rural region.

My join misses points near the date line. What is wrong?

Your bounding box straddles the ±180° antimeridian, so it has min_lon > max_lon and the simple >=/<= range matches nothing. Detect the wrap when you compute the box in Python and split it into two predicates joined by OR (one running up to +180°, one from −180°). The same care applies near the poles, where the cos(latitude) longitude widening must be clamped to ±180° rather than dividing by a near-zero cosine.

Index-Probe Spatial Joins in Cypher — the nested-index-loop join that replaces a Cartesian product with per-row index seeks.
Snapping GPS Telemetry to Road Segments — map-matching noisy fixes to the nearest segment by perpendicular distance.
Distance Filter Query Patterns — the bounding-box-then-distance predicate that every join phase one depends on.
K-Nearest-Neighbor Routing — when nearest-k assignment outgrows a join and needs a dedicated technique.
Cypher Performance Tuning — the PROFILE-driven loop for keeping the join’s inner access index-bound.
Spatial Indexing Strategies — choosing the point index that makes the join seekable on both sides.
OSM Data Ingestion Pipelines — where the join step attaches freshly ingested geometry to the existing graph.

This guide is part of Cypher Spatial Queries & Pathfinding Patterns.

For authoritative reference on native spatial functions and geometry standards, consult the Neo4j Cypher Spatial Functions Documentation, the Neo4j GDS KNN documentation, and the OGC Simple Features Specification.

Related pages

Subtopics

Siblings