Cypher Spatial Queries & Pathfinding Patterns

A spatial routing API fails in three ways that all trace back to the query layer: a distance predicate skips the index and the p99 latency jumps from 20 ms to 4 seconds, a pathfinding call expands the whole graph because the cost function was never index-anchored, or a concurrent burst exhausts the connection pool and every request times out at once. This guide is for the backend and data engineers who own those failure modes — the people writing the Cypher and the async Python that turns a graph of coordinates into a route under a latency budget. It covers how the Cypher planner resolves spatial predicates, how to keep distance filters and nearest-neighbor searches index-backed, which traversal algorithm to reach for (shortestPath, Dijkstra, A*, or contraction hierarchies), and how to harden the whole path against the fragmentation and contention that only surface under production load.

The diagram below is the mental model for every query on this page: a request never touches the full graph — the planner seeds from the spatial index, hands a small candidate set to the traversal engine, and evaluates exact cost last.

Concept & Architecture

Cypher treats geography as a first-class type rather than a bolted-on extension. A coordinate is stored as a native point({latitude, longitude}) value with the WGS84 CRS, and the engine maintains an R-tree-backed point index over that property. This is the structural reason graph routing outperforms a relational equivalent: in a tabular model a k-hop shortest path requires k self-joins on an edge table and the optimizer re-estimates join cardinality at every hop, whereas in a graph the same traversal is a sequence of constant-time pointer chases and the spatial index is consulted only at the endpoints — to anchor the origin and destination — never at the intermediate hops. The storage model that makes this possible is covered in depth under Spatial Graph Database Fundamentals for Python; this guide assumes that foundation and focuses on the query language on top of it.

That endpoint-only indexing is also the single most important invariant to protect. The moment a distance function leaks into a WHERE clause without a bounding pre-filter, the planner can no longer use the point index and the query degrades to a full label scan — linear in node count and quadratic the instant a second MATCH introduces a Cartesian product. Every pattern below exists to keep the index in the loop: pre-filter to a corridor, anchor the traversal inside it, then compute exact great-circle cost on the survivors.

Spatial primitives in Cypher are deliberately minimal. point.distance(a, b) returns the great-circle distance in meters between two WGS84 points; point.withinBBox(p, lowerLeft, upperRight) tests bounding-box containment using the index; and arithmetic on .latitude / .longitude accessors lets you build explicit corridors. There is no native polygon-contains in core Cypher, so polygon membership is approximated with a bounding box plus an exact test in Python — a split that mirrors the two-stage strategy throughout this guide.

Schema Design

Routing queries are only as fast as the schema they run against. Three decisions determine whether the planner can stay index-backed.

Node property model and point type. Anchor every routable vertex on a single native point property, not separate lat/lon floats. Bare floats force the planner onto two independent range indexes it cannot combine for a distance predicate; a point property gives it one R-tree seek. Keep a stable, application-assigned id (distinct from the internal element id) so ingestion and external systems can upsert idempotently.

CREATE CONSTRAINT location_id_unique IF NOT EXISTS
FOR (n:Location) REQUIRE n.id IS UNIQUE;

CREATE POINT INDEX location_coord IF NOT EXISTS
FOR (n:Location) ON (n.coord);

Relationship direction and cost. A CONNECTED_TO relationship carries the routing impedance and the access semantics. Store cost as a precomputed scalar (distance in meters, or travel_seconds for time-based routing) so traversal never recomputes geometry mid-query. Direction is load-bearing: a one-way segment is a single (:Location)-[:CONNECTED_TO]->(:Location), while a two-way segment is either two relationships or one queried without a direction arrow. Encode the travel mode (travel_mode) and temporal windows (valid_from, valid_to) as edge properties. The mechanics of deriving these directional edges from raw road geometry belong to Node and Edge Spatial Mapping.

CREATE INDEX rel_mode IF NOT EXISTS
FOR ()-[r:CONNECTED_TO]-() ON (r.travel_mode);

Tenant isolation. On multi-tenant routing platforms the isolation boundary must map to physical structure or it will be bypassed. A tenant_id property on every node and edge is the cheap option, but it is only safe when paired with a composite index so the filter is resolved at the storage tier, not applied post-scan — and when it is enforced in a query-builder layer rather than left to each caller. Database-per-tenant is the stronger, heavier alternative. The trade-offs and the access-control patterns that stop a route crossing a tenant boundary are detailed in Spatial Security Boundaries.

Core Python Integration

The driver layer, not Cypher, is where most production incidents originate — in how Python acquires, scopes, and releases sessions. Use the official neo4j async driver, create exactly one driver per process (it is a connection-pool manager, not a connection), and scope each unit of work to its own session. The class below sets a bounded pool, an acquisition timeout so a saturated pool fails fast instead of hanging, and a connection lifetime that recycles sockets ahead of load-balancer idle limits. It also demonstrates idempotent spatial ingestion and a parameterized query helper that every later pattern reuses.

import asyncio
from neo4j import AsyncDriver, AsyncGraphDatabase


def init_router_driver(uri: str, user: str, password: str, pool_size: int = 25) -> AsyncDriver:
    """One driver per process: a bounded, fast-failing connection pool."""
    return AsyncGraphDatabase.driver(
        uri,
        auth=(user, password),
        max_connection_pool_size=pool_size,
        connection_acquisition_timeout=10.0,
        max_connection_lifetime=300,
        max_transaction_retry_time=30,
    )


class SpatialRouter:
    def __init__(self, driver: AsyncDriver):
        self.driver = driver

    async def ingest_locations(self, batch: list[dict]) -> None:
        """Idempotent bulk upsert. UNWIND keeps the transaction flat;
        MERGE makes re-runs safe; point() registers the spatial index entry."""
        query = (
            "UNWIND $batch AS row "
            "MERGE (n:Location {id: row.id}) "
            "SET n.coord = point({latitude: row.lat, longitude: row.lon})"
        )
        async with self.driver.session(database="neo4j") as session:
            await session.run(query, batch=batch)

    async def query(self, cypher: str, params: dict) -> list[dict]:
        """Run a parameterized read and materialize records inside the session
        so the connection is released the instant results are drained."""
        async with self.driver.session(database="neo4j") as session:
            result = await session.run(cypher, params)
            return await result.data()

    async def close(self) -> None:
        await self.driver.close()


async def main():
    driver = init_router_driver("neo4j://localhost:7687", "neo4j", "secure_password")
    router = SpatialRouter(driver)
    try:
        await router.ingest_locations([
            {"id": "hub_01", "lat": 40.7128, "lon": -74.0060},
            {"id": "hub_02", "lat": 40.7580, "lon": -73.9855},
        ])
        rows = await router.query(
            "MATCH (n:Location) RETURN count(n) AS total", {}
        )
        print(f"Indexed locations: {rows[0]['total']}")
    finally:
        await router.close()


if __name__ == "__main__":
    asyncio.run(main())

Four patterns in this script recur on every page of this guide: one driver built once and closed in a finally block; sessions opened with async with so they are always released; MERGE for idempotent writes so a retried batch never duplicates a node; and connection_acquisition_timeout converting pool exhaustion from an unbounded hang into a catchable error you can shed load on. The streaming and back-pressured variants for loading continental datasets through this same driver are covered under async batch processing for graphs, and the end-to-end loaders under OSM data ingestion pipelines.

Indexing & Query Planning

A well-tuned routing query follows a strict three-act sequence: the planner consults the spatial index, hands a small candidate set to the traversal engine, and only then evaluates exact distance and cost. Diverging from this ordering is the single most common cause of latency spikes.

Neo4j’s native point index is R-tree-backed and is the correct default for road and logistics graphs; the full decision framework against geohash and quadtree alternatives — and the write-amplification trade-offs of each — lives in Spatial Indexing Strategies. What matters at the query layer is predicate push-down: the spatial filter must execute at the storage tier so the planner shrinks the candidate set before graph expansion. The query below forces an index-backed scan and bounds the result, so the planner emits a point-index seek feeding the distance computation rather than a label scan followed by a filter.

// Index-backed radius search: seed from the point index, compute exact distance last
MATCH (origin:Location {id: $origin_id})
MATCH (target:Location)
WHERE point.distance(origin.coord, target.coord) < $radius_m
RETURN target.id AS id,
       point.distance(origin.coord, target.coord) AS meters
ORDER BY meters
LIMIT 20

Verify the plan, never assume it: run PROFILE and confirm a point-index seek (not a NodeByLabelScan + Filter) feeds the distance evaluation, and that the planner consumes the index for the corridor predicate. When it picks the wrong starting point, an index hint or a reordered MATCH forces it back — the systematic approach to reading EXPLAIN/PROFILE and reshaping predicates so they stay sargable is the subject of Graph Query Planner Optimization. For dense urban grids where even a tight radius returns thousands of candidates, a coordinate-aligned bounding box applied before the distance math prunes harder still; that two-stage refinement is the focus of Distance Filter Query Patterns. One hard rule: never mix Cartesian and WGS84 points in the same predicate — implicit type coercion invalidates planner statistics and triggers a full scan, and when executing against the EPSG:4326 reference system Neo4j computes spherical distance natively.

The cost model behind every distance predicate is the Haversine great-circle formula, which also seeds the A* heuristic discussed next:

$$d = 2R \cdot \arcsin!\sqrt{\sin^2!\frac{\Delta\varphi}{2} + \cos\varphi_1 \cos\varphi_2 \sin^2!\frac{\Delta\lambda}{2}}$$

where $R$ is the mean Earth radius, $\varphi$ is latitude in radians, and $\lambda$ is longitude. Because this value never overestimates the true road distance between two points, it is an admissible heuristic — the property that lets A* explore a fraction of the graph while still returning the optimal path.

Routing & Traversal Patterns

Once the corridor is pruned and the endpoints are anchored, the choice of traversal algorithm determines both correctness and latency. Four families cover almost every production case; the right pick depends on graph size, query volume, and how often the topology changes.

Breadth-first shortestPath is the built-in default and is correct only when edges are unweighted or you need hop count alone. It minimizes hops, not cost, so it is the wrong tool the instant impedance varies.

Dijkstra is the baseline for weighted shortest paths. It explores outward by cumulative cost and guarantees the optimal route, but it expands uniformly in all directions, so on a continental graph it touches far more nodes than necessary. Reach for it when you have no admissible heuristic, when edge weights have no geometric meaning, or when you need one-to-many cost surfaces. In Neo4j it runs through the Graph Data Science (GDS) library over an in-memory projection:

// Project the routing graph into the GDS catalog (driving edges only)
CALL gds.graph.project(
  'routing_graph',
  'Location',
  {
    CONNECTED_TO: { properties: ['distance', 'travel_mode'] }
  },
  { nodeProperties: ['coord'] }
);

// Weighted shortest path by cumulative distance
CALL gds.shortestPath.dijkstra.stream('routing_graph', {
  sourceNode: $origin_id,
  targetNode: $target_id,
  relationshipWeightProperty: 'distance'
})
YIELD nodeId, totalCost
RETURN gds.util.asNode(nodeId).id AS location_id, totalCost
ORDER BY totalCost;

A* adds the Haversine straight-line distance to the destination as a heuristic that biases expansion toward the goal. With that admissible heuristic it returns the same optimal path as Dijkstra while exploring a fraction of the nodes — making it the default for interactive point-to-point routing on geographic graphs, where coordinates hand you the heuristic for free. GDS exposes it as gds.shortestPath.astar.stream with latitudeProperty/longitudeProperty parameters drawn from the projected coord. For multi-modal networks with mode-specific cost matrices, project a separate in-memory graph per travel mode rather than filtering relationships post-projection; filtering after the projection still pays to load the unused edges.

Contraction hierarchies and related preprocessing schemes trade build time for query speed. By precomputing shortcut edges over a node ordering they answer point-to-point queries on country-scale road networks in microseconds, but the preprocessing must be rebuilt when the topology changes — so they fit static or slowly-changing graphs, not networks under constant live edits.

The practical rule: start with A* for interactive point-to-point routing, fall back to Dijkstra when no heuristic applies or you need cost-to-all-targets, and invest in contraction hierarchies only once query volume on a stable graph justifies the preprocessing. Proximity-first workloads — “the five nearest depots that are actually reachable” — combine an index seed with a bounded expansion rather than a full shortest-path call:

async def find_knn_locations(router, lat: float, lon: float, k: int, max_dist_m: float):
    """KNN by spatial pre-filter then exact distance sort, memory-bounded by LIMIT."""
    query = (
        "WITH point({latitude: $lat, longitude: $lon}) AS query_pt "
        "MATCH (loc:Location) "
        "WHERE point.distance(query_pt, loc.coord) <= $max_dist_m "
        "RETURN loc.id AS id, point.distance(query_pt, loc.coord) AS dist "
        "ORDER BY dist ASC LIMIT $k"
    )
    return await router.query(
        query, {"lat": lat, "lon": lon, "max_dist_m": max_dist_m, "k": k}
    )

The max_dist_m bound is a hard spatial filter the index resolves, and LIMIT caps the result-set memory. The streaming priority-queue and traffic-weighted variants — where the distance metric folds in live impedance — are detailed under K-Nearest Neighbor Routing. When a query has to correlate two spatial sets — matching deliveries to the nearest available vehicle, or snapping events to road segments — the index-aware join strategies in Spatial Join Techniques avoid the Cartesian product that a naive double-MATCH produces.

Performance & Scale

Spatial query performance is a budget across three resources: heap, page cache, and the connection pool.

Memory budgets. Coordinate precision drives index depth. High-precision WGS84 coordinates deepen the point-index tree and lower cache-hit ratios; truncating to five decimal places (~1.1 m at the equator) is accurate enough for road routing and measurably shrinks the index. Size the page cache to hold the hot index pages and the most-traversed regions of the graph — if the routing working set spills to disk, p99 collapses. Keep the JVM heap separate and bounded; oversized heaps lengthen GC pauses that surface as periodic latency spikes.

Write amplification. Every edge insert touches the spatial index, and under dense urban grids the resulting node splits dominate write cost. Batch writes in bounded transactions (a few thousand operations each) so the index amortizes splits, and prefer append-then-reindex over interleaved single-row upserts during bulk loads.

Batch versus streaming ingestion. Materializing a whole network in memory before loading is the most common out-of-memory failure. Stream features through generators with back-pressure so the importer footprint stays flat regardless of dataset size — the async mechanics live under async batch processing for graphs.

Connection-pool sizing and GC pressure. Size max_connection_pool_size to match the server’s effective query concurrency, not the number of application coroutines — an oversized pool just moves contention from client to the server’s lock manager. Watch for GC pauses correlated with large intermediate result sets; the fix is almost always pushing filters down so fewer rows are materialized, which ties directly into Cypher Performance Tuning — execution-plan analysis, relationship directionality, and memory-constrained aggregation.

Failure Modes & Hardening

Most spatial query outages take one of four shapes. Knowing the symptom-to-cause mapping turns a 2 a.m. page into a checklist.

Topology corruption. Self-intersecting geometry, duplicate coordinates, and misaligned directional edges create phantom paths that return wrong-but-plausible routes. The tripwire is a geodesic plausibility check: when the planned cost wildly exceeds the straight-line Haversine distance, suspect a topology defect. Harden against it by enforcing snapping tolerance and directional consistency at ingestion and running periodic degree-and-connectivity audits that flag orphaned nodes and one-way traps.

Index fragmentation. Frequent edge mutations leave the spatial index unbalanced and range-query latency creeps up until background compaction catches up. The recovery playbook: schedule online index rebuilds in low-traffic windows, monitor index page-fault rates, and prefer deferred or batched index updates on write-heavy partitions.

Connection-pool exhaustion. A leaked session, a slow query holding a connection, or a pool sized below real concurrency all present identically — requests hang, then fail at the acquisition timeout. The connection_acquisition_timeout from the integration code converts this from a hang into a fast, sheddable error. Recovery is to cap query time with transaction timeouts, open every session in an async with block, and alarm on pool-utilization percentage rather than on errors after the fact.

Planner regression. A predicate that was index-backed yesterday can fall back to a full scan after a statistics refresh, a query rewrite, or a data-distribution shift. Guard against it by pinning critical queries with PROFILE-verified plans in CI, asserting the expected operator (point-index seek) appears, and re-checking after any schema or version change. The diagnostic workflow sits in Graph Query Planner Optimization.

Operational Checklist

Use this as a pre-production gate and a recurring health review:

Schema validation — uniqueness constraint on Location.id; point index on coord; travel_mode and any tenant_id index present and confirmed used via PROFILE.
Index warm-up — hot index and graph regions resident in page cache before serving traffic; cold-start latency measured, not assumed.
Predicate push-down — every spatial and tenant filter verified as an index seek in PROFILE, never label-scan-then-filter.
Pool sizing — max_connection_pool_size matched to server query concurrency; connection_acquisition_timeout set; every session opened in async with.
Query bounding — every distance query carries a radius/bbox pre-filter and a LIMIT; no unbounded MATCH that risks a Cartesian product.
CRS hygiene — coordinates normalized to WGS84 at ingestion; no Cartesian/WGS84 mixing in a single predicate; precision truncated to routing tolerance.
Algorithm fit — A* for point-to-point, Dijkstra for cost-to-all-targets, contraction hierarchies only on stable graphs; GDS projections rebuilt on topology change.
Routing correctness — geodesic plausibility check on returned paths; degree/connectivity audit flagging orphans and one-way traps.
Monitoring hooks — alarms on pool utilization, index page-fault rate, GC pause duration, and p99 query latency.

Distance Filter Query Patterns — two-stage bounding-box-plus-distance pruning that keeps radius searches index-backed.
K-Nearest Neighbor Routing — streaming priority-queue expansion and traffic-weighted nearest-node search.
Spatial Join Techniques — correlating two spatial sets without a Cartesian product.
Cypher Performance Tuning — EXPLAIN/PROFILE analysis, index hints, and memory-constrained aggregation.
Isochrone and Service-Area Analysis — cost-bounded traversal that returns everywhere reachable within a time or distance budget.
Spatial Indexing Strategies — choosing R-tree, geohash, or quadtree indexes behind these queries.

This guide is a companion track in the Python for Spatial Graph Databases & Network Routing knowledge base; it builds on Spatial Graph Database Fundamentals for Python, feeds the loaders documented under Spatial Graph Construction & OSM Ingestion, and supplies the query layer for Network Routing Algorithms in Python.

For official implementation details, consult the Neo4j Cypher Manual on Spatial Indexes and Python’s asyncio documentation for event-loop scheduling.

Related pages

Subtopics

Siblings