Distance Filter Query Patterns for Spatial Graph Routing

Q: Why does my point.distance query still scan the whole label?

Because point.distance() is a computed function, not an indexable property, so it cannot push down on its own. Add a bounding-box range predicate on location.latitude and location.longitude ahead of it, confirm a POINT INDEX exists on location, and keep the box predicate on the anchor node before any expansion. Run PROFILE and look for a PointIndexSeekByRange at the base of the plan; a NodeByLabelScan feeding a Filter means push-down failed.

Q: Should I compute the bounding box in Python or in Cypher?

In Python. A box derived inside Cypher with per-row trigonometry cannot be pushed down to the index, so the planner reverts to a scan. Compute the four corners client-side, pass them as parameters, and the range comparison becomes index-seekable while the plan stays cacheable.

Q: Is approximate Euclidean distance ever safe to use?

For small, low-latitude extents such as micro-mobility, indoor, or single-campus routing, projecting to a Cartesian CRS like EPSG:3857 and using straight-line distance can reduce CPU at acceptable error. The distortion grows with latitude and span, so it is wrong for continental logistics. Keep WGS 84 and point.distance() as the default and benchmark any approximation against your accuracy and latency budget first.

A pathfinding query that does not bound its search radius will read the entire graph before it returns a single route. On a continental road network that means tens of millions of nodes scanned to answer a question that, geometrically, only touches a few hundred. The cost is not abstract: p99 latency climbs into seconds, the page cache thrashes, the connection pool drains under concurrency, and a routing endpoint that worked in staging falls over the first time real traffic clusters in a city. Distance filter query patterns fix this at the source — they apply a coordinate-anchored predicate that the spatial index can seek, so the engine enters the graph through a bounded window and the expensive distance math only ever runs on survivors. This guide shows how to build those predicates correctly for a Python-driven Neo4j workload, drive them from the async driver, profile them, and harden them against the precision and topology traps that silently corrupt results. It is one of the core techniques in Cypher Spatial Queries & Pathfinding Patterns.

Prerequisites

These examples assume an async Python service talking to a Neo4j instance with native point support. The point.distance() semantics and index-backed range predicates used below are stable on Neo4j 5.x; the bounding-box math is pure client-side Python and version-independent.

Requirement	Minimum version	Notes
Python	3.10+	Union types and structural `match` used in examples
Neo4j	5.13+	Native `point` type, `CREATE POINT INDEX`, index-backed range predicates
neo4j (driver)	5.x	Async driver (`AsyncGraphDatabase`), native point serialization
pytest / pytest-asyncio	0.23+	For the correctness assertions in the testing section

pip install "neo4j>=5.18" "pytest>=8.0" "pytest-asyncio>=0.23"

This pattern assumes your graph already follows sound node and edge spatial mapping conventions — coordinates stored as native point values on the primitives you actually filter, not as detached lat/lon strings that no index can seek — and that the right spatial indexing strategy backs the location property. Without an index, every pattern below degrades to a full label scan no matter how tight the predicate reads.

Core Concept & Mechanism

Neo4j represents geography with the native point() type, which defaults to the WGS 84 ellipsoid (SRID 4326) for latitude/longitude coordinates. The point.distance() function returns the great-circle distance in meters between two such points. The trap is that point.distance() is a computed function, not an indexable property: when it appears alone in a WHERE clause, the planner has no seekable range to descend into, so it falls back to a label scan and evaluates the function once per node. Complexity becomes O(n) in the label size, independent of how small the search radius is.

The fix is a two-stage predicate. Stage one constrains candidates with a coordinate-aligned bounding box — four simple range comparisons on location.latitude and location.longitude that the native point index (an R-tree variant) can seek directly. Stage two applies exact point.distance() only to the bounded survivors, clipping the box corners back to a true circle. In dense urban graphs this collapses the candidate set by 90–99% before a single distance call runs, which is the difference between an index seek and a full scan.

The mechanism that makes stage one work is predicate push-down: the planner recognizes the range comparison as index-seekable and enters the graph through a PointIndexSeekByRange. The deeper cost-model and plan-selection details belong to graph query planner optimization; here it is enough to know that the bounding box is the predicate shape the planner can actually push down.

The bounding box itself is derived from the search radius. For a radius $r$ meters at latitude $\phi$ on a sphere of radius $R$, the half-extents in degrees are:

$$\Delta\phi = \frac{r}{R} \cdot \frac{180}{\pi}, \qquad \Delta\lambda = \frac{r}{R \cos\phi} \cdot \frac{180}{\pi}$$

The $\cos\phi$ term widens the longitude band toward the poles, where meridians converge. Computing this client-side keeps the box as plain query parameters the planner can seek, rather than forcing the engine to derive it per row.

Schema & Data Model

The planner can only seek an index that exists, and only when the predicate shape matches it. Store coordinates as a native point on the node so the range comparison is index-backed, and keep edge weights separate so distance filtering and cost-weighted traversal stay independent.

// Native point index — backs the bounding-box range predicate and point.distance()
CREATE POINT INDEX road_node_location IF NOT EXISTS
FOR (n:RoadNode) ON (n.location);

// Lookup index on the stable id used to anchor route queries
CREATE INDEX road_node_id IF NOT EXISTS
FOR (n:RoadNode) ON (n.id);

// Representative shape of the indexed spatial graph
// (:RoadNode {id, location: point({srid:4326, latitude, longitude})})
//   -[:CONNECTED_TO {length_m, travel_s, weight}]->
// (:RoadNode)

Anchor the index on the property the predicate actually filters. A point index on a :RoadNode does nothing for a distance predicate evaluated over edge geometry, so for segment-level filtering keep a precomputed bounding box on the relationship instead. Edge weight stays distinct from raw length_m so that a distance filter (a spatial constraint) and a shortest-path cost (a traversal constraint) never get conflated — a mistake that produces routes that are short in kilometers but wrong in travel time.

Step-by-Step Implementation

The workflow is: compute the bounding box client-side, pass it as parameters, let the index seek the box, then clip to the exact radius. We build it as runnable async code.

1. Compute the bounding box client-side

Deriving the box in Python keeps the corners as stable parameters the planner can seek. Never compute the box inside Cypher — a per-row trig expression cannot be pushed down to the index.

import asyncio
import math
from neo4j import AsyncGraphDatabase

EARTH_RADIUS_M = 6_371_000.0  # mean spherical radius


def compute_bounding_box(lat: float, lon: float, radius_m: float) -> dict:
    """WGS84 degree-space bounding box for spatial-index pre-filtering.

    Spherical approximation; the longitude band widens with latitude via cos(phi).
    """
    d_lat = math.degrees(radius_m / EARTH_RADIUS_M)
    d_lon = math.degrees(radius_m / (EARTH_RADIUS_M * math.cos(math.radians(lat))))
    return {
        "min_lat": lat - d_lat, "max_lat": lat + d_lat,
        "min_lon": lon - d_lon, "max_lon": lon + d_lon,
    }

2. Run the two-stage radius query through the async driver

Parameterized execution lets the driver serialize coordinates into the binary protocol, keeps the plan cacheable, and closes the door on injection. The bounding-box comparison seeks the index; the point.distance() guard clips the box corners back to a circle.

async def query_spatial_radius(driver, lat: float, lon: float, radius_m: float):
    bbox = compute_bounding_box(lat, lon, radius_m)
    query = """
    WITH point({srid: 4326, latitude: $lat, longitude: $lon}) AS target
    MATCH (n:RoadNode)
    WHERE n.location.latitude  >= $min_lat AND n.location.latitude  <= $max_lat
      AND n.location.longitude >= $min_lon AND n.location.longitude <= $max_lon
    WITH n, target, point.distance(n.location, target) AS dist_m
    WHERE dist_m <= $radius
    RETURN n.id AS node_id, dist_m
    ORDER BY dist_m ASC
    LIMIT 200
    """
    async with driver.session() as session:
        result = await session.run(
            query, lat=lat, lon=lon, radius=radius_m, **bbox
        )
        return [record.data() async for record in result]

3. Wire it into a pooled async service

Tune max_connection_pool_size to the concurrency of your request handlers, and set an acquisition timeout so a query that accidentally falls back to a scan fails fast instead of starving the pool.

async def main():
    pool_config = {
        "max_connection_pool_size": 40,
        "connection_acquisition_timeout": 5.0,
        "max_transaction_retry_time": 10.0,
    }
    driver = AsyncGraphDatabase.driver(
        "neo4j+s://your-cluster.databases.neo4j.io",
        auth=("neo4j", "secure-password"),
        **pool_config,
    )
    try:
        nodes = await query_spatial_radius(driver, 40.7128, -74.0060, 5000)
        print(f"Resolved {len(nodes)} nodes within 5 km radius.")
    finally:
        await driver.close()


if __name__ == "__main__":
    asyncio.run(main())

Query Patterns & Variants

The same “within distance” intent takes several shapes. Pick the one whose anchor matches how the index is structured and how the result is consumed.

Variant A — bounded radius (the default). Box-then-distance, returning everything inside a fixed radius, sorted nearest-first. This is the shape from the implementation above and the one to reach for first.

WITH point({srid: 4326, latitude: $lat, longitude: $lon}) AS target
MATCH (n:RoadNode)
WHERE n.location.latitude  >= $min_lat AND n.location.latitude  <= $max_lat
  AND n.location.longitude >= $min_lon AND n.location.longitude <= $max_lon
WITH n, point.distance(n.location, target) AS dist_m
WHERE dist_m <= $radius
RETURN n.id, dist_m ORDER BY dist_m LIMIT 200
// $min_*/$max_* always come from compute_bounding_box(); never derive the box in Cypher.

Variant B — nearest-K without a fixed radius. When the question is “the closest k nodes” rather than “everything within r”, drop the distance guard but keep the bounding box so the index still bounds the scan. Widen the box and re-run if fewer than k rows return at the edge of coverage. This is the entry point shared with k-nearest-neighbor routing, where the bounded candidate set feeds a graph projection.

WITH point({srid: 4326, latitude: $lat, longitude: $lon}) AS target
MATCH (n:RoadNode)
WHERE n.location.latitude  >= $min_lat AND n.location.latitude  <= $max_lat
  AND n.location.longitude >= $min_lon AND n.location.longitude <= $max_lon
RETURN n.id, point.distance(n.location, target) AS dist_m
ORDER BY dist_m ASC LIMIT $k

Variant C — distance-pruned path expansion. Routing queries often need the distance filter applied during traversal so the engine never materializes geometrically implausible detours. Bounding both endpoints of a variable-length path keeps the expansion inside a corridor instead of exploding combinatorially. Pair this with weighted edges to prefer realistic transit, and cross-reference k-nearest-neighbor routing for ranking the resulting candidates.

WITH point({srid: 4326, latitude: $lat, longitude: $lon}) AS origin
MATCH (start:RoadNode {id: $start_id})
MATCH path = (start)-[:CONNECTED_TO*1..8]->(dest:RoadNode)
WHERE dest.location.latitude  >= $min_lat AND dest.location.latitude  <= $max_lat
  AND dest.location.longitude >= $min_lon AND dest.location.longitude <= $max_lon
  AND point.distance(dest.location, origin) <= $radius
RETURN dest.id, reduce(c = 0.0, r IN relationships(path) | c + r.weight) AS cost
ORDER BY cost ASC LIMIT 25
// Cap the variable-length bound (*1..8); an unbounded star will materialize the whole component.

For segment-by-segment cumulative-distance accumulation along a path — the harder case where each hop is checked, not just the endpoint — see Filtering Graph Paths by Haversine Distance in Cypher.

Performance Tuning

Profiling is the whole game. EXPLAIN returns the plan without running it — use it in CI to assert plan shape; PROFILE runs the query and annotates each operator with real db hits and rows. Read the plan bottom-up and find the first operator whose rows count dwarfs the final result; that is where the predicate is failing to bound the scan.

Confirm the seek, not the scan. A healthy radius query shows a PointIndexSeekByRange (or NodeIndexSeekByRange) at the base of the plan. If you see a NodeByLabelScan feeding a Filter on point.distance, push-down failed — the box predicate is missing, malformed, or sitting after an expansion.
Keep the box on the anchor. The four range comparisons must reference the node whose index you want to seek. Moving them downstream of a MATCH expansion defeats the index.
Parameterize everything. Literal coordinates baked into the query string force recompilation and thrash the plan cache. Pass $min_lat, $radius, etc. as parameters with stable numeric types.
Size the page cache for the hot region. Seeks only stay fast if the working set’s nodes and the point index live in memory. Mirror that client-side with a bounded max_connection_pool_size so you do not over-subscribe the server.
Trade accuracy for speed deliberately. Exact great-circle math via point.distance() carries measurable CPU cost. For micro-mobility or indoor routing, an approximate Euclidean distance over a projected CRS (such as EPSG:3857) can be acceptable, but the distortion grows with latitude and span — benchmark against your latency SLO before adopting it, and keep WGS 84 for anything continental.

This profiling loop — capture PROFILE, find the widest operator, tighten the predicate or add the index, re-profile — is the same one detailed in Cypher performance tuning. When distance filters need to correlate against external datasets (telemetry, POI catalogs), the join itself becomes the bottleneck; spatial join techniques cover index-probe joins that avoid the cross-product blowup.

Edge Cases & Gotchas

Mixed CRS coordinates. A geographic point({latitude, longitude}) (SRID 4326) and a Cartesian point({x, y}) (SRID 7203) are not comparable; point.distance() across SRIDs returns null, and a null predicate silently drops the row instead of erroring. Normalize CRS at ingestion and assert the SRID before querying.
Antimeridian and polar wrap. A bounding box straddling ±180° longitude produces min_lon > max_lon, so a naive range comparison returns nothing. Split the box into two queries across the seam, or special-case high-latitude searches where the longitude band balloons past 180°.
Coordinate precision traps. Float rounding on dense grids can make two endpoints “almost equal”, creating phantom dead-ends or duplicate nodes that distort distance results. Snap to a fixed tolerance during ingestion, not at query time.
The box is a square, the radius is a circle. Skipping the point.distance() guard returns the box corners — up to ~27% more area than the inscribed circle. Always keep the second stage if you need true radius semantics.
Unbounded variable-length paths. A [:CONNECTED_TO*] with no upper bound will materialize the entire connected component before any distance filter applies. Always cap the hop count and bound the endpoints.
Driver timeout masquerading as pool exhaustion. A query that falls back to a full scan blows past connection_acquisition_timeout under load and drains the pool. A timeout storm at peak traffic is usually a missing-seek symptom, not a pool-size problem.

Verification & Testing

A distance filter is only safe if the bounded query returns the same rows as a brute-force scan, just faster. Assert both correctness (the right nodes, in the right order) and plan shape (a seek, not a scan) — a regression that turns the seek back into a scan changes only latency, so a correctness test alone will not catch it.

import pytest
from neo4j import AsyncGraphDatabase

SEED = """
CREATE (a:RoadNode {id: 1, location: point({srid:4326, latitude: 40.7128, longitude: -74.0060})})
CREATE (b:RoadNode {id: 2, location: point({srid:4326, latitude: 40.7300, longitude: -74.0100})})
CREATE (c:RoadNode {id: 3, location: point({srid:4326, latitude: 41.5000, longitude: -74.9000})})
"""


@pytest.mark.asyncio
async def test_bounded_radius_matches_bruteforce():
    driver = AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "test"))
    async with driver.session(database="neo4j") as s:
        await s.run("MATCH (n) DETACH DELETE n")
        await s.run(SEED)
        await s.run(
            "CREATE POINT INDEX road_node_location IF NOT EXISTS "
            "FOR (n:RoadNode) ON (n.location)"
        )

        # Ground truth: brute-force distance over every node, no bounding box.
        truth = await (await s.run(
            """
            WITH point({srid:4326, latitude: 40.7128, longitude: -74.0060}) AS t
            MATCH (n:RoadNode)
            WITH n, point.distance(n.location, t) AS d WHERE d <= 5000
            RETURN n.id AS id ORDER BY d
            """
        )).values()

        # Bounded two-stage query under test (box from compute_bounding_box).
        got = await (await s.run(
            """
            WITH point({srid:4326, latitude: 40.7128, longitude: -74.0060}) AS t
            MATCH (n:RoadNode)
            WHERE n.location.latitude  >= 40.6679 AND n.location.latitude  <= 40.7577
              AND n.location.longitude >= -74.0653 AND n.location.longitude <= -73.9467
            WITH n, point.distance(n.location, t) AS d WHERE d <= 5000
            RETURN n.id AS id ORDER BY d
            """
        )).values()

    assert got == truth, "bounded query must match brute-force result set"
    await driver.close()

Pair this with a plan-shape check: run EXPLAIN on the bounded query, read the plan from result.consume(), and assert it contains a point index seek rather than a label scan. Run both assertions in CI so a refactor that drops the box predicate is caught before it ships.

FAQ

Why does my point.distance query still scan the whole label?

Because point.distance() is a computed function, not an indexable property, so it cannot push down on its own. Add a bounding-box range predicate on location.latitude/location.longitude ahead of it, confirm a POINT INDEX exists on location, and keep the box predicate on the anchor node before any expansion. Run PROFILE and look for a PointIndexSeekByRange at the base of the plan; a NodeByLabelScan feeding a Filter means push-down failed.

Do I really need both the bounding box and the distance check?

Yes, for true radius semantics. The bounding box is what the index seeks, but it is a square — its corners extend roughly 27% beyond the inscribed circle. The point.distance() guard clips those corners back to an exact radius. Drop it only when you genuinely want box semantics or are doing nearest-K, where ORDER BY ... LIMIT k replaces the radius.

Should I compute the bounding box in Python or in Cypher?

In Python. A box derived inside Cypher with per-row trig cannot be pushed down to the index, so the planner reverts to a scan. Compute the four corners client-side, pass them as parameters, and the range comparison becomes index-seekable while the plan stays cacheable.

Is approximate (Euclidean) distance ever safe to use?

For small, low-latitude extents — micro-mobility, indoor, single-campus routing — projecting to a Cartesian CRS such as EPSG:3857 and using straight-line distance can shave CPU at acceptable error. The distortion grows with latitude and span, so it is wrong for continental logistics. Keep WGS 84 and point.distance() as the default and benchmark any approximation against your accuracy and latency budget before adopting it.

How do I filter distance along a multi-hop path, not just to an endpoint?

Endpoint filtering bounds where a route may finish; cumulative filtering bounds the route’s total length as it expands. For the latter, accumulate per-segment Haversine distance across relationships(path) and prune when the running sum exceeds tolerance. That segment-level technique is covered in detail in Filtering Graph Paths by Haversine Distance in Cypher.

Filtering Graph Paths by Haversine Distance in Cypher — segment-level cumulative distance pruning along variable-length paths.
Bounding-Box Search Across the Antimeridian — splitting a radius box that straddles the ±180° meridian so the index still seeks it.
K-Nearest-Neighbor Routing — feeding a bounded candidate set into a graph projection and shortest-path pass.
Spatial Join Techniques — index-probe joins for correlating spatial nodes with external datasets.
Cypher Performance Tuning — the PROFILE-driven loop for keeping these predicates index-backed.
Spatial Indexing Strategies — choosing the index that makes the bounding box seekable.

This guide is part of Cypher Spatial Queries & Pathfinding Patterns.

For authoritative reference on native spatial functions and geometry standards, consult the Neo4j Cypher Spatial Functions Documentation, the OGC Simple Features Specification, and ISO 19111.

Related pages

Subtopics

Siblings