Filtering Graph Paths by Haversine Distance in Cypher

A variable-length MATCH over a dense road or transit network expands combinatorially, and if you compute spherical distance after the paths are materialized, the JVM heap fills with millions of permutations before a single one is rejected — the symptom is a routing endpoint that passes staging and then throws OutOfMemoryError or times out the moment real traffic clusters in one city. The root cause is post-query evaluation: the planner buffers every candidate path, then the application filters. This page resolves it by pushing a cumulative point.distance() accumulator into the WHERE pipeline of the path match itself, so geometrically implausible branches are discarded as the expansion runs rather than after it finishes. This is the segment-level case of the broader distance filter query patterns — where each hop is checked, not just the endpoint.

Prerequisites & Versions

The accumulator below relies on native point support and the reduce() list function, both stable on Neo4j 5.x. The Python side uses the official async driver; no APOC or GDS dependency is required for the filter itself.

Requirement	Minimum version	Install / note
Python	3.10+	`tuple[str, str]` and union syntax used below
Neo4j	5.13+	Native `point`, `CREATE POINT INDEX`, `point.distance()`
neo4j (driver)	5.x	`AsyncGraphDatabase`, native point serialization

pip install "neo4j>=5.18"

This pattern assumes your graph already follows sound node and edge spatial mapping conventions — coordinates stored as native point values on the nodes, edge lengths kept distinct from routing weights — and that a spatial indexing strategy backs the anchor property so the initial node lookup seeks rather than scans.

Implementation

The query computes cumulative great-circle distance across every relationship in a bounded variable-length path, enforces a hard meter threshold, and returns only surviving paths sorted shortest-first. The reduce() accumulator walks the relationship stream, summing point.distance() between each edge’s start and end node.

// One-time: native point index so the anchor lookup seeks instead of scans
CREATE POINT INDEX location_spatial_idx IF NOT EXISTS
FOR (l:Location) ON (l.location);

MATCH path = (start:Location {id: $start_id})-[:CONNECTS*1..8]->(end:Location {id: $end_id})
WITH path,
     reduce(cumulative_m = 0.0, r IN relationships(path) |
         cumulative_m + point.distance(startNode(r).location, endNode(r).location)) AS dist_m
WHERE dist_m <= $max_meters
RETURN path, dist_m
ORDER BY dist_m ASC
LIMIT 50

Driven from a pooled async service, the whole thing is a single self-contained coroutine. Thresholds and node ids are passed as parameters so the driver serializes them into the binary protocol and the plan stays cacheable:

import asyncio
from neo4j import AsyncGraphDatabase

QUERY = """
MATCH path = (start:Location {id: $start_id})-[:CONNECTS*1..8]->(end:Location {id: $end_id})
WITH path,
     reduce(cumulative_m = 0.0, r IN relationships(path) |
         cumulative_m + point.distance(startNode(r).location, endNode(r).location)) AS dist_m
WHERE dist_m <= $max_meters
RETURN path, dist_m
ORDER BY dist_m ASC
LIMIT 50
"""


async def filter_paths_by_haversine(
    uri: str,
    auth: tuple[str, str],
    start_id: str,
    end_id: str,
    max_km: float,
) -> list:
    driver = AsyncGraphDatabase.driver(
        uri,
        auth=auth,
        max_connection_pool_size=50,
        connection_acquisition_timeout=5.0,
        max_connection_lifetime=3600,
    )
    try:
        async with driver.session(database="neo4j") as session:
            result = await session.run(
                QUERY,
                start_id=start_id,
                end_id=end_id,
                max_meters=max_km * 1000,  # point.distance() returns meters
            )
            return [record["path"] async for record in result]
    finally:
        await driver.close()


if __name__ == "__main__":
    paths = asyncio.run(
        filter_paths_by_haversine(
            "neo4j+s://your-cluster.databases.neo4j.io",
            ("neo4j", "secure-password"),
            start_id="N-1001",
            end_id="N-2087",
            max_km=12.0,
        )
    )
    print(f"Resolved {len(paths)} paths within the distance envelope.")

How It Works

The performance comes from where the predicate runs, not from any exotic function. Three mechanics carry it:

Inline pruning. The WHERE dist_m <= $max_meters clause sits immediately after the reduce() projection, so the planner evaluates the accumulator and rejects over-budget paths before buffering them for RETURN. Paths that blow the envelope never reach the heap as result rows.
Native spherical arithmetic. point.distance() operates directly on WGS 84 (SRID 4326) coordinates and applies the great-circle (Haversine) formula server-side. No custom trigonometric UDFs, no client round-trips, no external geospatial library.
Bounded recursion. *1..8 caps the expansion horizon. An unbounded [:CONNECTS*] will materialize the entire connected component regardless of the distance threshold — the cap is what keeps the combinatorics finite so the filter has anything to prune against.

A coordinate-ordering caveat threads through all of it: WGS 84 points constructed with positional arguments follow the (longitude, latitude) convention, so the unambiguous named form — point({latitude: $lat, longitude: $lon}) — is preferred at ingestion. Misordered positional coordinates introduce silent spatial drift that compounds across every hop of a multi-segment path. Validate coordinate ingestion before indexing, not at query time.

Why the cumulative accumulator rather than a single endpoint check: bounding only the destination tells you where a route may finish, but a path can wander far outside the envelope and still land near the target. Summing per-segment distance bounds the route’s total length as it expands, which is the semantics routing actually needs. Bounding the endpoint as well (a cheap index-seekable range predicate) is a useful first-stage complement covered in the parent distance filter query patterns.

Common Failure Patterns

1. Full label scan instead of an index seek. Variable-length matches with reduce() still scan the whole label if no spatial index backs the anchor lookup. Run PROFILE and read the plan bottom-up: a healthy query shows a NodeIndexSeekByRange (point index) feeding the expansion. If you see NodeByLabelScan, the seek failed — usually because the index is missing, the location property holds mixed types (strings alongside points), or the index is in a FAILED state.

SHOW INDEXES YIELD name, type, state, properties
WHERE 'location' IN properties;  -- state must read ONLINE

2. Unit and threshold mismatch. point.distance() returns meters, always. Passing a kilometer value straight into $max_meters silently filters at 1/1000th the intended radius and returns an empty set — or, inverted, returns everything. Normalize at the boundary (the Python helper multiplies max_km * 1000) and never mix units inside the accumulator.

3. Null distances from mixed CRS. A geographic point({latitude, longitude}) (SRID 4326) and a Cartesian point({x, y}) (SRID 7203) are not comparable; point.distance() across SRIDs returns null, and a null term poisons the reduce() sum so the whole path silently drops. Assert SRID consistency at ingestion, and guard defensively if your graph mixes frames:

WITH path,
     reduce(c = 0.0, r IN relationships(path) |
         c + coalesce(point.distance(startNode(r).location, endNode(r).location), 0.0)) AS dist_m

Use coalesce only as a diagnostic crutch — a path that needs it has a data-quality problem upstream, not a query problem.

Performance Notes

point.distance() computes the great-circle distance between two coordinates on a sphere of radius $R$:

$$d = 2R \arcsin!\left(\sqrt{\sin^{2}!\left(\tfrac{\Delta\phi}{2}\right) + \cos\phi_1,\cos\phi_2,\sin^{2}!\left(\tfrac{\Delta\lambda}{2}\right)},\right)$$

where $\phi_1,\phi_2$ are the endpoint latitudes and $\Delta\phi,\Delta\lambda$ the latitude and longitude deltas. The cost is a handful of trig calls per relationship per candidate path — cheap individually, but the accumulator pays it on every edge of every surviving branch, so total CPU scales with (paths × average depth), not with node count.

That product is where the strategy boundaries live:

Heap pressure on dense grids. Even a tight threshold can leave thousands of valid permutations within 4–5 hops in a highly connected urban graph. The reduce() accumulator carries state per path, so heap grows roughly linearly with the surviving-path count. Tighten $max_meters, lower the *1..8 cap, or pre-filter both endpoints with a bounding box before expanding.
Not a shortest-path guarantee. This filters by cumulative distance and returns every path under the threshold sorted post-filter — it does not find the optimal route. For true shortest paths, delegate to the Neo4j GDS library’s gds.shortestPath.dijkstra or gds.shortestPath.astar; the trade-offs are laid out under k-nearest-neighbor routing.
Plan-cache thrashing. Wildly varying $max_meters values can still recompile if combined with literal structural changes; keep the query text fixed and pin thresholds to discrete tiers (5 km, 10 km, 25 km) so the plan cache stays warm. The full PROFILE-driven tightening loop is documented in Cypher performance tuning.

When traversals routinely exceed 8 hops, or you need multi-modal edge weighting (road distance combined with transit schedules), the inline reduce() stops being the right tool: switch to a bounding-box pre-filter feeding a GDS shortest-path pass. Reserve connection_acquisition_timeout low (5 s above) so a query that accidentally falls back to a scan fails fast instead of draining the pool — a timeout storm at peak is almost always a missing-seek symptom, not a pool-size one.

Distance Filter Query Patterns — the endpoint-and-bounding-box predicates this segment-level filter complements.
K-Nearest-Neighbor Routing — ranking bounded candidates and handing them to GDS shortest-path.
Spatial Join Techniques — index-probe joins when the distance filter must correlate against external datasets.
Cypher Performance Tuning — the PROFILE loop for keeping the anchor lookup index-backed.
Graph Query Planner Optimization — why predicate placement decides seek versus scan.

This guide is part of Distance Filter Query Patterns, within the Cypher Spatial Queries & Pathfinding Patterns pillar.

For authoritative reference on native spatial functions, consult the Neo4j Cypher Spatial Functions Documentation.

Filtering Graph Paths by Haversine Distance in Cypher

Prerequisites & Versions

Implementation

How It Works

Common Failure Patterns

Performance Notes

Related

Related pages

Siblings