K-Nearest Neighbor Routing in Production Spatial Graphs

Q: Should I use Dijkstra or A* for the re-rank?

Dijkstra for small, tightly clustered candidate sets, since it is simpler and an A* heuristic is not repaid. A* for large projections with far targets, where the straight-line-to-target heuristic prunes a large fraction of the search space. A* needs latitude and longitude properties in the projection so it can compute an admissible heuristic. Benchmark both against your latency SLO on representative data.

Q: Why does a hub inside the radius sometimes not appear in the result?

It is unreachable on the directed graph, because a topology gap, a one-way trap, or an unmerged import means there is no path from the origin. gds.shortestPath.dijkstra simply omits unreachable targets, so the candidate vanishes silently. Assert the returned count meets your minimum and fall back to the next candidates; fix the underlying gap during ingestion so reachable hubs are never dropped.

Asking “which depots are nearest?” with a straight-line distance returns the wrong answer the moment a river, a motorway with no on-ramp, or a one-way grid sits between the query point and the candidate. The crow-flies nearest hub can be a 40-minute detour while the second-nearest is five minutes down a through-road. K-nearest neighbor routing fixes this by ranking candidates on network travel cost instead of coordinate proximity: it first uses a spatial index to pull a small, bounded candidate set, then runs a real shortest-path pass over the road graph to re-rank those candidates by the cost a vehicle actually pays. Get the first phase wrong and the database scans the whole graph for every dispatch; get the second phase wrong and you assign the geometrically-close-but-unreachable hub, which surfaces as missed SLAs, idle vehicles, and angry operations dashboards. This guide builds that two-phase pattern as runnable async Python over Neo4j, profiles it, and hardens it against the precision and projection traps that quietly corrupt results. It is one of the core techniques in Cypher Spatial Queries & Pathfinding Patterns.

Prerequisites

These examples assume an async Python service talking to Neo4j with the Graph Data Science (GDS) library installed, since the second phase projects a subgraph and runs a shortest-path algorithm. The bounding-box math is pure client-side Python and version-independent; the point.distance() and index-backed range semantics are stable on Neo4j 5.x.

Requirement	Minimum version	Notes
Python	3.10+	Union types and structural `match` used in examples
Neo4j	5.13+	Native `point` type, `CREATE POINT INDEX`, index-backed range predicates
neo4j (driver)	5.x	Async driver (`AsyncGraphDatabase`), native point serialization
Graph Data Science	2.5+	`gds.graph.project` Cypher aggregation, `gds.shortestPath.dijkstra` / `.astar`
pytest / pytest-asyncio	0.23+	For the correctness assertions in the testing section

pip install "neo4j>=5.18" "pytest>=8.0" "pytest-asyncio>=0.23"

The graph this pattern runs against must already follow sound node and edge spatial mapping conventions — coordinates stored as native point values on the nodes you filter, and traversable segments stored as weighted directed relationships — backed by the right spatial indexing strategy on the location property. Without a point index, the candidate phase degrades to a full label scan and the latency win disappears.

Core Concept & Mechanism

K-nearest neighbor routing separates two questions that beginners conflate: who is geometrically close and who is cheapest to reach. Solving them in one pass is intractable — a true cost-ranked nearest-K over a continental graph would expand shortest paths to every node before sorting. The pattern instead runs in two phases, each using the data structure suited to its question.

Phase one — spatial pre-filter. A coordinate-aligned bounding box on location.latitude/location.longitude lets the native point index (an R-tree variant) seek a small candidate window directly, and point.distance() clips that box to a true radius. This is exactly the technique covered in distance filter query patterns; KNN routing consumes its output. Because straight-line distance is a lower bound on network distance, every node within the road-cost answer is guaranteed to sit inside a sufficiently generous straight-line radius — so over-fetching candidates here is safe, and under-fetching is the only correctness risk.

Phase two — network re-rank. The bounded candidate set is projected into an in-memory GDS graph along with the relationships connecting them to the origin, and a weighted shortest-path algorithm (Dijkstra, or A* when a geographic heuristic helps) computes the true travel cost from origin to each candidate. The candidates are then sorted by that cost and the top k returned. The straight-line ranking from phase one is discarded — it was only ever a filter, never the answer.

The safety of over-fetching follows from the metric inequality. For any candidate node, its network distance $d_{net}$ is bounded below by its great-circle distance $d_{geo}$:

$$d_{geo}(\text{origin}, n) \le d_{net}(\text{origin}, n)$$

So if you need the k cheapest-to-reach hubs, fetching the m straight-line nearest with $m > k$ (typically $m = 3k$ to $5k$) guarantees the true answer is contained in the candidate set, provided the straight-line radius is wide enough to admit the detour factor of your network. A grid city has a detour factor near 1.3; mountain or coastal road networks can exceed 2.5, and the over-fetch must widen to match.

Schema & Data Model

The candidate phase can only seek an index that exists, and the re-rank phase can only project relationships that carry a weight. Store coordinates as a native point on each node, keep a stable node_id for anchoring, and store the traversal cost as a dedicated relationship property distinct from raw length.

// Native point index — backs the bounding-box range predicate and point.distance()
CREATE POINT INDEX network_node_location IF NOT EXISTS
FOR (n:NetworkNode) ON (n.location);

// Lookup index on the stable id used to anchor route queries
CREATE INDEX network_node_id IF NOT EXISTS
FOR (n:NetworkNode) ON (n.node_id);

// Representative shape of the indexed routing graph
// (:NetworkNode {node_id, location: point({srid:4326, latitude, longitude})})
//   -[:CONNECTED_TO {length_m, travel_s, weight}]->
// (:NetworkNode)

Keep weight (the value the shortest-path algorithm minimizes) separate from length_m. If the business question is “nearest by drive time”, weight should be travel_s; conflating it with raw distance produces hubs that are short in kilometers but slow in minutes — the exact failure KNN routing exists to prevent. Hub or facility nodes can carry a second label (e.g. :Hub) so the candidate query filters to assignable targets only, rather than every intersection in the graph.

Step-by-Step Implementation

The flow is: compute the bounding box client-side, seek an over-fetched candidate set through the index, project just those candidates and the origin’s neighborhood into GDS, run a weighted shortest path, and re-rank. We build it as runnable async code.

1. Compute the bounding box client-side

Deriving the box in Python keeps the four corners as stable parameters the planner can seek. Never compute the box inside Cypher — a per-row trig expression cannot be pushed down to the index.

import asyncio
import math
from typing import Dict, List, Tuple
from neo4j import AsyncGraphDatabase

EARTH_RADIUS_M = 6_371_000.0  # mean spherical radius


def compute_bounding_box(lat: float, lon: float, radius_m: float) -> Dict[str, float]:
    """WGS84 degree-space bounding box for spatial-index pre-filtering.

    Spherical approximation; the longitude band widens with latitude via cos(phi).
    """
    d_lat = math.degrees(radius_m / EARTH_RADIUS_M)
    d_lon = math.degrees(radius_m / (EARTH_RADIUS_M * math.cos(math.radians(lat))))
    return {
        "min_lat": lat - d_lat, "max_lat": lat + d_lat,
        "min_lon": lon - d_lon, "max_lon": lon + d_lon,
    }

2. Seek an over-fetched candidate set through the point index

Fetch more candidates than you ultimately need (fetch_k = k * overfetch) so the straight-line filter cannot drop a hub that is geometrically farther but cheaper to reach. The bounding-box comparison seeks the index; point.distance() clips the corners to a circle and orders the survivors nearest-first.

async def fetch_candidates(
    driver,
    origin: Tuple[float, float],   # (lat, lon)
    fetch_k: int,
    max_radius_m: float,
) -> List[Dict[str, float]]:
    lat, lon = origin
    bbox = compute_bounding_box(lat, lon, max_radius_m)
    query = """
    WITH point({srid: 4326, latitude: $lat, longitude: $lon}) AS target
    MATCH (n:Hub)
    WHERE n.location.latitude  >= $min_lat AND n.location.latitude  <= $max_lat
      AND n.location.longitude >= $min_lon AND n.location.longitude <= $max_lon
    WITH n, point.distance(n.location, target) AS geo_m
    WHERE geo_m <= $radius
    RETURN n.node_id AS node_id, geo_m
    ORDER BY geo_m ASC
    LIMIT $fetch_k
    """
    async with driver.session(database="neo4j") as session:
        result = await session.run(
            query, lat=lat, lon=lon, radius=max_radius_m,
            fetch_k=fetch_k, **bbox,
        )
        return [record.data() async for record in result]

3. Re-rank candidates by true network cost with GDS

Project a named graph (or use a transient projection) and run Dijkstra from the origin’s nearest network node to each candidate, then sort by totalCost. Project only the labels and relationship type you need so the in-memory graph stays small. The gds.graph.project Cypher aggregation is the current API; the legacy gds.graph.project.cypher procedure is deprecated.

async def rank_by_network_cost(
    driver,
    source_node_id: int,
    candidate_ids: List[int],
    k: int,
) -> List[Dict[str, float]]:
    query = """
    // Source node and the candidate targets resolved by stable id
    MATCH (src:NetworkNode {node_id: $source_node_id})
    MATCH (dst:NetworkNode) WHERE dst.node_id IN $candidate_ids
    WITH src, collect(dst) AS targets
    UNWIND targets AS dst
    CALL gds.shortestPath.dijkstra.stream('routing_graph', {
        sourceNode: src,
        targetNode: dst,
        relationshipWeightProperty: 'weight'
    })
    YIELD targetNode, totalCost
    RETURN gds.util.asNode(targetNode).node_id AS node_id,
           totalCost AS travel_cost
    ORDER BY travel_cost ASC
    LIMIT $k
    """
    async with driver.session(database="neo4j") as session:
        result = await session.run(
            query, source_node_id=source_node_id,
            candidate_ids=candidate_ids, k=k,
        )
        return [record.data() async for record in result]

4. Wire the two phases into a pooled async service

The service caps max_connection_pool_size to request-handler concurrency and sets an acquisition timeout so a query that accidentally falls back to a scan fails fast instead of starving the pool. The projection is created once and reused across requests; only re-project when the graph topology changes.

class KNNRoutingService:
    def __init__(self, uri: str, auth: Tuple[str, str], pool_size: int = 40) -> None:
        self.driver = AsyncGraphDatabase.driver(
            uri, auth=auth,
            max_connection_pool_size=pool_size,
            connection_acquisition_timeout=5.0,
            max_transaction_retry_time=10.0,
        )

    async def ensure_projection(self) -> None:
        async with self.driver.session(database="neo4j") as session:
            await session.run("""
            CALL gds.graph.exists('routing_graph') YIELD exists
            WITH exists WHERE NOT exists
            MATCH (s:NetworkNode)-[r:CONNECTED_TO]->(t:NetworkNode)
            WITH gds.graph.project('routing_graph', s, t,
                 {relationshipProperties: r {.weight}}) AS g
            RETURN g.graphName AS name
            """)

    async def nearest_hubs(
        self, origin: Tuple[float, float], source_node_id: int,
        k: int = 3, overfetch: int = 4, max_radius_m: float = 15_000,
    ) -> List[Dict[str, float]]:
        await self.ensure_projection()
        candidates = await fetch_candidates(
            self.driver, origin, fetch_k=k * overfetch, max_radius_m=max_radius_m,
        )
        if not candidates:
            return []  # caller widens max_radius_m and retries
        ids = [c["node_id"] for c in candidates]
        return await rank_by_network_cost(self.driver, source_node_id, ids, k)

    async def close(self) -> None:
        await self.driver.close()


async def main():
    svc = KNNRoutingService(
        "neo4j+s://your-cluster.databases.neo4j.io",
        auth=("neo4j", "secure-password"),
    )
    try:
        hubs = await svc.nearest_hubs(origin=(40.7128, -74.0060), source_node_id=1001, k=3)
        for h in hubs:
            print(f"hub {h['node_id']}: {h['travel_cost']:.0f} cost units")
    finally:
        await svc.close()


if __name__ == "__main__":
    asyncio.run(main())

Query Patterns & Variants

The same “nearest by network cost” intent takes several shapes. Pick the one whose ranking metric and target shape match how dispatch consumes the result.

Variant A — pure spatial candidates (phase one only). When straight-line proximity is genuinely good enough (dense uniform grid, no barriers) skip the projection entirely and return the box-clipped nearest-K. This is the cheapest query and the fallback when GDS is unavailable.

WITH point({srid: 4326, latitude: $lat, longitude: $lon}) AS target
MATCH (n:Hub)
WHERE n.location.latitude  >= $min_lat AND n.location.latitude  <= $max_lat
  AND n.location.longitude >= $min_lon AND n.location.longitude <= $max_lon
RETURN n.node_id, point.distance(n.location, target) AS geo_m
ORDER BY geo_m ASC LIMIT $k
// $min_*/$max_* always come from compute_bounding_box(); never derive the box in Cypher.

Variant B — A* re-rank with a geographic heuristic. On large projections A* prunes far more of the search space than Dijkstra by using straight-line distance to the target as an admissible heuristic. Supply the latitudeProperty/longitudeProperty so GDS can compute the heuristic; the candidate must carry a location in the projection.

MATCH (src:NetworkNode {node_id: $source_node_id})
MATCH (dst:NetworkNode {node_id: $target_node_id})
CALL gds.shortestPath.astar.stream('routing_graph', {
    sourceNode: src,
    targetNode: dst,
    latitudeProperty: 'lat',
    longitudeProperty: 'lon',
    relationshipWeightProperty: 'weight'
})
YIELD totalCost
RETURN totalCost AS travel_cost
// A* wins when targets are far and the heuristic is tight; for tiny projections Dijkstra is simpler.

Variant C — multi-source assignment (which hub serves each request). Dispatch often inverts the question: given many open requests, assign each to its cheapest hub. Run a single-source shortest path from each hub over the candidate set and keep the minimum per request, which amortizes traversal across the batch instead of re-expanding per request.

UNWIND $hub_ids AS hub_id
MATCH (h:NetworkNode {node_id: hub_id})
CALL gds.shortestPath.dijkstra.stream('routing_graph', {
    sourceNode: h,
    relationshipWeightProperty: 'weight'
})
YIELD targetNode, totalCost
WITH gds.util.asNode(targetNode).node_id AS request_node, hub_id, totalCost
ORDER BY totalCost ASC
RETURN request_node, head(collect(hub_id)) AS assigned_hub, min(totalCost) AS cost
// Cap the candidate set; an all-pairs expansion over the full graph will exhaust heap.

When the candidates must be correlated against external datasets — live capacity feeds, demand telemetry — the join itself becomes the bottleneck; spatial join techniques cover the index-probe joins that avoid a cross-product blowup. For a full worked dispatch scenario, see Implementing KNN Search for Nearby Logistics Hubs.

Performance Tuning

Profiling is the whole game, and KNN routing has two cost centers to watch independently: the candidate seek and the projection traversal.

Confirm the candidate phase seeks, not scans. Run PROFILE on fetch_candidates; a healthy plan shows a PointIndexSeekByRange at the base. A NodeByLabelScan feeding a Filter on point.distance means push-down failed — the box predicate is missing, malformed, or sitting after an expansion. This profiling loop is the same one detailed in cypher performance tuning, and the cost-model reasoning behind plan selection belongs to graph query planner optimization.
Tune the over-fetch to your detour factor, not a guess. Too low and you drop a reachable-but-distant hub; too high and the projection wastes heap. Measure the ratio of network to straight-line distance on a sample of real routes and set overfetch just above it.
Reuse the projection. Re-projecting the graph per request is the most common latency killer. Project once, keep the named graph resident, and re-project only on topology change. For volatile graphs, project a regional sub-area sized to the candidate radius rather than the whole network.
Bound the candidate radius and k. An unbounded radius or a large k forces GDS to expand paths to far targets, allocating heap for relationships and triggering stop-the-world GC. Use a sliding window: fetch k * overfetch, rank, and widen the radius only if the best path cost exceeds a service threshold.
Prefer A* for far, sparse targets. When candidates sit far from the origin in a large projection, the A* heuristic prunes dramatically more than Dijkstra. For tight clusters the heuristic overhead is not worth it — benchmark both against your latency SLO.
Parameterize everything. Literal coordinates baked into the query string force recompilation and thrash the plan cache. Pass $min_lat, $radius, $k as parameters with stable numeric types.

Edge Cases & Gotchas

Empty candidate set at the edge of coverage. A radius tighter than the nearest hub returns nothing, and a naive caller reports “no hubs”. Detect the empty result, widen max_radius_m (e.g. double it), and retry up to a cap before declaring genuine non-coverage.
Disconnected candidate (in the box, unreachable on the graph). A hub can sit inside the radius yet have no directed path from the origin — a topology gap, a one-way trap, or an unmerged import. gds.shortestPath.dijkstra simply omits unreachable targets, so a candidate silently vanishes from the ranking. Assert that the returned count meets your minimum and fall back to the next candidates if not.
Straight-line under-fetch drops the true winner. If overfetch is too small for the network’s detour factor, the cheapest-to-reach hub never enters the candidate set and the answer is wrong but plausible-looking. This is the single most dangerous KNN bug because it produces no error — only a worse assignment. Validate overfetch against measured detour ratios.
Mixed CRS coordinates. A geographic point({latitude, longitude}) (SRID 4326) and a Cartesian point({x, y}) (SRID 7203) are not comparable; point.distance() across SRIDs returns null, and a null predicate silently drops the row. Normalize CRS at ingestion and assert the SRID before querying.
Stale projection after a graph write. GDS projections are in-memory snapshots. New or rewired edges added after projection are invisible to the shortest-path pass, so routes follow the old topology. Re-project (or use a write-through projection strategy) whenever the underlying graph changes.
weight conflated with distance. Minimizing length_m when the question is drive time returns short-but-slow routes. Keep the cost property that matches the business metric and pass it explicitly as relationshipWeightProperty.
Driver timeout masquerading as pool exhaustion. A candidate query that falls back to a scan, or a per-request re-projection, blows past connection_acquisition_timeout under load and drains the pool. A timeout storm at peak is usually a missing-seek or re-projection symptom, not a pool-size problem.

Verification & Testing

KNN routing is only safe if the ranked result matches a brute-force network-cost ranking over the same candidates — the bounding box and over-fetch are an optimization, not a change in answer. Assert both correctness (the right hubs, in the right cost order) and that no reachable winner was dropped by an under-sized over-fetch.

import pytest
from neo4j import AsyncGraphDatabase

SEED = """
CREATE (q:NetworkNode:Hub {node_id: 1, location: point({srid:4326, latitude: 40.7128, longitude: -74.0060})})
CREATE (a:NetworkNode:Hub {node_id: 2, location: point({srid:4326, latitude: 40.7150, longitude: -74.0090})})
CREATE (b:NetworkNode:Hub {node_id: 3, location: point({srid:4326, latitude: 40.7135, longitude: -74.0065})})
CREATE (q)-[:CONNECTED_TO {weight: 600.0}]->(a)   // close by road
CREATE (q)-[:CONNECTED_TO {weight: 90.0}]->(b)    // farther in km, cheaper to reach
"""


@pytest.mark.asyncio
async def test_knn_ranks_by_network_cost_not_geometry():
    driver = AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "test"))
    async with driver.session(database="neo4j") as s:
        await s.run("MATCH (n) DETACH DELETE n")
        await s.run(SEED)
        await s.run(
            "CREATE POINT INDEX network_node_location IF NOT EXISTS "
            "FOR (n:NetworkNode) ON (n.location)"
        )

        # Network-cost truth: shortest path from origin to every reachable hub.
        truth = await (await s.run(
            """
            MATCH (q:NetworkNode {node_id: 1})
            MATCH (q)-[r:CONNECTED_TO]->(h:Hub)
            RETURN h.node_id AS id ORDER BY r.weight ASC
            """
        )).values()

    # Hub 3 is geometrically nearer to hub 2's region but, by road, hub 3 is the
    # cheapest to reach — the ranking must follow cost, not straight-line distance.
    assert truth[0] == [3], "nearest hub must be ranked by network cost, not geometry"
    await driver.close()

Pair this with a plan-shape check on the candidate query: run EXPLAIN, read the plan from result.consume(), and assert it contains a point index seek rather than a label scan. Run both in CI so a refactor that drops the box predicate or shrinks the over-fetch is caught before it ships.

FAQ

Why not just use point.distance() and skip the graph traversal?

Because straight-line distance ignores the network. A hub across an unbridged river or behind a one-way grid can be the crow-flies nearest yet a long detour by road. point.distance() is correct only as a filter — it is a lower bound on travel cost, so it safely bounds the candidate set, but the final ranking must come from a shortest-path pass over weighted edges. Skip the traversal only when the network has no meaningful barriers.

How many candidates should I over-fetch before the network re-rank?

Enough to cover your network’s detour factor. Measure the ratio of network distance to straight-line distance on a sample of real routes: grid cities sit near 1.3, coastal or mountain networks can exceed 2.5. Set the straight-line radius and fetch_k = k * overfetch just above that ratio. Too low silently drops the true winner; too high wastes projection heap. Typical starting points are overfetch of 3–5.

Should I use Dijkstra or A* for the re-rank?

Dijkstra for small, tightly clustered candidate sets — it is simpler and the overhead of an A* heuristic is not repaid. A* for large projections with far targets, where the straight-line-to-target heuristic prunes a large fraction of the search space. A* needs latitudeProperty/longitudeProperty in the projection so it can compute an admissible heuristic. Benchmark both against your latency SLO on representative data.

Why does a hub inside the radius sometimes not appear in the result?

It is unreachable on the directed graph — a topology gap, a one-way trap, or an unmerged import means there is no path from the origin. gds.shortestPath.dijkstra simply omits unreachable targets, so the candidate vanishes silently. Assert the returned count meets your minimum and fall back to the next candidates; fix the underlying gap during ingestion so reachable hubs are never dropped.

Do I need to re-project the GDS graph for every request?

No, and doing so is the most common latency killer. Project the named graph once, keep it resident, and reuse it across requests; re-project only when the topology changes. For volatile graphs, project a regional sub-area sized to the candidate radius rather than the whole network. Remember that a projection is an in-memory snapshot — writes after projection are invisible until you refresh it.

Distance Filter Query Patterns — the bounded candidate-retrieval technique that feeds phase one.
Implementing KNN Search for Nearby Logistics Hubs — a full worked dispatch scenario for this pattern.
GDS kNN vs Bounded-Radius kNN in Neo4j — when to precompute a similarity graph versus seek per query.
Spatial Join Techniques — index-probe joins for correlating candidates with external capacity and demand feeds.
Cypher Performance Tuning — the PROFILE-driven loop for keeping the candidate phase index-backed.
Spatial Indexing Strategies — choosing the index that makes the candidate bounding box seekable.

This guide is part of Cypher Spatial Queries & Pathfinding Patterns.

For authoritative reference, consult the Neo4j Graph Data Science pathfinding documentation, the Neo4j Cypher spatial functions documentation, and the Python asyncio documentation.

Related pages

Subtopics

Siblings