Implementing KNN Search for Nearby Logistics Hubs

Dispatch services that answer “which depots are closest to this drop-off?” stall the moment the query touches every LogisticsHub node: with no spatial index, Neo4j falls back to a NodeByLabelScan, computes point.distance() for millions of rows, then sorts the entire result before applying LIMIT k. Under concurrent dispatch load this exhausts page cache, spikes garbage-collection pauses, and pushes p99 latency past timeout thresholds. The fix is a two-step k-nearest-neighbor (KNN) lookup: a client-computed bounding box drives an index seek that shrinks the candidate set to hundreds of nodes, and only those survivors get the exact ellipsoidal distance sort. This page gives a complete, runnable implementation and the failure modes that bite in production. It is a focused recipe within the broader K-Nearest Neighbor Routing workflow.

Prerequisites & Versions

Component	Minimum version	Install / setup
Python	3.10	`pyenv install 3.10` (for `tuple[str, str]` typing)
`neo4j` async driver	5.14	`pip install "neo4j>=5.14"`
Neo4j server	5.x	native `POINT INDEX` support
Hub coordinates	—	stored as native `point` (WGS-84 / EPSG:4326)

KNN search depends on the same node-and-edge layout described in node and edge spatial mapping: each LogisticsHub carries a single location property of the native point type. Stringified JSON, flat arrays, or split lat/lon numeric properties silently disqualify the node from the spatial planner and force a scan.

Implementation

The implementation has three parts: an index, a client-side bounding-box helper, and an async service class that runs the parameterized seek-and-sort query.

First, create the spatial index. A native point index is what turns the latitude/longitude range predicate into a seek:

CREATE POINT INDEX hub_location_idx IF NOT EXISTS
FOR (h:LogisticsHub) ON (h.location)

Verify it with SHOW INDEXES: state must read ONLINE and type must be POINT. During ingestion enforce strict construction with point({latitude: $lat, longitude: $lon}) so every node lands in the index. Choosing the right index for road-graph workloads is covered in spatial indexing strategies; for nearest-hub lookups the point index is the only correct choice.

Next, the bounding-box helper. A circular radius search is approximated as a rectangle so the index can resolve it as two range predicates. The longitude offset widens with latitude because meridians converge toward the poles:

import math
from typing import Dict


def compute_bounding_box(lat: float, lon: float, radius_km: float) -> Dict[str, float]:
    """Approximate a circular search radius as a lat/lon bounding box.

    Uses the mean meridional length (~111.32 km/degree) and scales the
    longitude offset by cos(latitude) to correct for parallel shrinkage.
    """
    lat_offset = radius_km / 111.32
    lon_offset = radius_km / (111.32 * math.cos(math.radians(lat)))
    return {
        "min_lat": lat - lat_offset,
        "max_lat": lat + lat_offset,
        "min_lon": lon - lon_offset,
        "max_lon": lon + lon_offset,
    }

Finally, the async service. It computes the box client-side, passes every value as a parameter (so the planner caches one reusable plan), seeks on the box, and sorts only the survivors:

import asyncio
from neo4j import AsyncGraphDatabase
from neo4j.exceptions import Neo4jError

KNN_QUERY = """
WITH point({latitude: $lat, longitude: $lon}) AS query_point
MATCH (h:LogisticsHub)
WHERE h.location.latitude  >= $min_lat AND h.location.latitude  <= $max_lat
  AND h.location.longitude >= $min_lon AND h.location.longitude <= $max_lon
  AND point.distance(h.location, query_point) <= $radius_m
RETURN h.id AS hub_id, h.name AS hub_name,
       point.distance(h.location, query_point) / 1000.0 AS dist_km
ORDER BY dist_km ASC
LIMIT $k
"""


class LogisticsKNNService:
    def __init__(self, uri: str, auth: tuple[str, str], pool_size: int = 50):
        self.driver = AsyncGraphDatabase.driver(
            uri,
            auth=auth,
            max_connection_pool_size=pool_size,
            connection_acquisition_timeout=5.0,
            max_connection_lifetime=3600,
        )

    async def find_nearest_hubs(
        self, lat: float, lon: float, radius_km: float, k: int = 5
    ) -> list[dict]:
        bounds = compute_bounding_box(lat, lon, radius_km)
        params = {
            "lat": lat,
            "lon": lon,
            "k": k,
            "radius_m": radius_km * 1000.0,
            **bounds,
        }
        async with self.driver.session(database="neo4j") as session:
            try:
                result = await session.run(KNN_QUERY, **params)
                return [record.data() async for record in result]
            except Neo4jError as exc:
                raise RuntimeError(f"KNN query failed: {exc}") from exc

    async def close(self) -> None:
        await self.driver.close()


async def main() -> None:
    service = LogisticsKNNService("neo4j://localhost:7687", ("neo4j", "password"))
    try:
        hubs = await service.find_nearest_hubs(52.5200, 13.4050, radius_km=25, k=5)
        for hub in hubs:
            print(f"{hub['hub_name']:<24} {hub['dist_km']:.2f} km")
    finally:
        await service.close()


if __name__ == "__main__":
    asyncio.run(main())

How It Works

The query reads top to bottom but the planner executes it as a tight pipeline:

The box predicate is the seekable part. The four >= / <= comparisons against h.location.latitude and h.location.longitude are range predicates the point index can push down, so the plan starts with a PointIndexSeekByRange instead of a NodeByLabelScan. This is the single change that separates a 5 ms query from a 5 s one.
point.distance() clips the corners. A bounding box is a square whose corners reach roughly 27% beyond the inscribed radius circle. The point.distance(...) <= $radius_m guard restores true radius semantics by discarding those corner false positives. It cannot be pushed to the index — it runs as a Filter on the already-seeked rows, which is cheap because there are now only hundreds of them.
The sort is bounded. ORDER BY dist_km operates on the filtered survivor set, and LIMIT $k caps output. Sorting cost drops from O(N log N) over the whole label to O(M log M) where M is the box hit count.
Parameters keep the plan cached. Passing lat, lon, the four bounds, radius_m, and k as parameters means the planner compiles the query once and reuses it across every dispatch call, eliminating recompilation latency. The deeper mechanics of why parameterization preserves plan reuse are covered in optimizing Cypher query plans for spatial data.

This box-then-distance shape is the same primitive used across distance filter query patterns; KNN simply adds ORDER BY ... LIMIT k on top of the radius filter to rank rather than just select.

Common Failure Patterns

1. The longitude upper-bound typo. The most common copy-paste bug is writing the longitude guard as two lower-bound checks:

-- WRONG: never closes the eastern edge of the box
WHERE h.location.longitude >= $min_lon AND h.location.longitude >= $max_lon

Because $max_lon > $min_lon, the second clause subsumes the first and the box becomes an unbounded half-plane — the seek returns every hub east of min_lon, and the result set silently balloons. Always pair >= $min_lon with <= $max_lon.

2. Plan falls back to a label scan. If PROFILE shows NodeByLabelScan feeding a Filter instead of PointIndexSeekByRange, the index is not being used. The usual causes are a mixed-type location property (some nodes hold strings), an index stuck in FAILED or POPULATING state, or building the bounding box inside Cypher with per-row trigonometry — which is not seekable. Fix: enforce point-typed ingestion, confirm SHOW INDEXES reads ONLINE, and always compute the box in Python.

3. Empty results near a sparse radius. A box calibrated for dense urban depots returns nothing in rural regions. Rather than widening the radius globally (which re-inflates M everywhere), retry with an expanding radius until k results appear:

async def find_with_backoff(service, lat, lon, k=5, start_km=10, max_km=160):
    radius = start_km
    while radius <= max_km:
        hubs = await service.find_nearest_hubs(lat, lon, radius, k)
        if len(hubs) >= k:
            return hubs
        radius *= 2
    return hubs  # best effort at max radius

Performance Notes

The whole point of the bounding box is to shrink the row count the engine sorts. With hub density $\rho$ (hubs per km²) and search radius $r$ km, the expected number of candidates the seek returns is the box area times density:

$$ M \approx \rho \cdot (2\Delta_{lat})(2\Delta_{lon}) \cdot 111.32^2 \cos\phi \quad\text{where}\quad \Delta_{lat} = \frac{r}{111.32},;; \Delta_{lon} = \frac{r}{111.32,\cos\phi} $$

which simplifies to $M \approx 4\rho r^2$. The exact-distance guard then discards the corner overshoot, leaving roughly $\pi r^2 \rho$ true hits — the box does about $\tfrac{4}{\pi} \approx 1.27\times$ more distance evaluations than the ideal circle, a negligible overhead for the index-seek payoff.

Budget guidance:

Latency. For typical dispatch radii (10–50 km) with $M$ in the hundreds, expect sub-50 ms p99 once the index is warm in page cache. Cold cache forces disk I/O on the range scan and spikes tail latency — warm the index after restart with a representative query.
Write amplification. Heavy ingest into LogisticsHub fragments the point index; rebuild it during a maintenance window (DROP INDEX hub_location_idx; CREATE POINT INDEX ...) rather than letting fragmentation degrade seek selectivity.
When to switch strategies. Beyond ~10M hubs, partition by geographic region so each tenant or zone seeks a smaller index. And when “nearest” must mean travel cost rather than straight-line distance, this query becomes only the candidate-selection phase: materialize the top-K hubs as a subgraph and hand them to a Dijkstra/A* traversal in a separate transaction, keeping spatial lookup isolated from pathfinding.

K-Nearest Neighbor Routing — the two-phase pre-filter-then-traverse workflow this lookup feeds.
Distance Filter Query Patterns — the radius-filter primitive KNN ranks on top of.
Filtering Graph Paths by Haversine Distance in Cypher — applying spherical distance during traversal, not just selection.
Spatial Indexing Strategies — choosing and validating the index behind the seek.

This guide is part of K-Nearest Neighbor Routing, within the Cypher Spatial Queries & Pathfinding Patterns pillar.

Implementing KNN Search for Nearby Logistics Hubs

Prerequisites & Versions

Implementation

How It Works

Common Failure Patterns

Performance Notes

Related

Related pages

Siblings