Enriching POI Data with Real-Time Demographics

The symptom that brings teams here is a routing graph whose pathfinding latency creeps from sub-100ms into the high hundreds the moment a live demographics feed is wired in. Foot-traffic vectors, mobility heatmaps, and census microdata arrive at sub-second intervals, and the obvious fix — a MATCH ... SET per payload — quietly blocks every concurrent Dijkstra or A* traversal on the hottest nodes in the graph. The root cause is lock contention: transit hubs, commercial intersections, and logistics waypoints carry the most edges and attract the most demographic updates, so write locks and read traversals collide on exactly the same vertices. This page resolves that by decoupling stream ingestion from graph mutation with one runnable async enricher that buffers writes by spatial partition, flushes them as version-guarded UNWIND batches, validates coordinates before any write, and bounds concurrency so the transaction manager never stalls active routing queries.

Prerequisites & Versions

Library	Min version	Install
Python	3.11	`asyncio.TaskGroup`, `tuple[str, str]` typing
`neo4j` async driver	5.14	`pip install "neo4j>=5.14"`
`h3`	4.1	`pip install "h3>=4.1"`
Neo4j server	5.x	`docker run -p7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5`

This guide assumes the POI nodes being enriched already exist with a location point and a stable id. Those anchors are produced upstream by the OSM data ingestion pipelines stage and loaded with the throughput patterns in scaling async graph ingestion with Python asyncio. Demographic enrichment is a write-back layer on top of that topology — never enrich coordinates that have no matching node, or you will mint phantom POIs that corrupt routing.

Implementation

The architecture routes every demographic payload through an async consumer that buckets writes by H3 cell, so geographically co-located POIs flush together and adjacent updates serialize instead of deadlocking. Each partition fills an in-memory buffer until it crosses batch_size or a flush_interval timer fires, then commits one UNWIND sweep guarded by a monotonic version so out-of-order deliveries can never overwrite fresher data. A semaphore caps concurrent flushes to keep the transaction manager off its knees.

The upsert anchors on a uniqueness constraint over id so the MATCH is an index seek, not a label scan, and a POINT INDEX keeps the spatial property usable by downstream routing. Create both before running the worker:

CREATE CONSTRAINT poi_id_unique IF NOT EXISTS
FOR (p:POI) REQUIRE p.id IS UNIQUE;

CREATE POINT INDEX poi_location IF NOT EXISTS
FOR (p:POI) ON (p.location);

import asyncio
import logging
from collections import defaultdict
from datetime import datetime, timezone

import h3
from neo4j import AsyncGraphDatabase
from neo4j.exceptions import TransientError, ServiceUnavailable

logging.basicConfig(level=logging.INFO)

# The WHERE clause is the whole concurrency story: a SET only fires when the
# incoming snapshot strictly beats the stored one, so out-of-order and
# at-least-once deliveries collapse to the newest demographic snapshot.
UPSERT_QUERY = """
UNWIND $batch AS rec
MATCH (p:POI {id: rec.poi_id})
WHERE coalesce(p.enrichment_version, 0) < rec.version
SET p.demographics      = rec.demographics,
    p.last_enriched     = rec.ts,
    p.enrichment_version = rec.version
RETURN count(p) AS applied
"""


def validate_wgs84(lat: float, lon: float) -> bool:
    """Reject malformed coordinates before H3 resolution: out-of-range inputs
    yield silent cell collisions or raise h3.H3FailedError."""
    return -90.0 <= lat <= 90.0 and -180.0 <= lon <= 180.0


class DemographicEnricher:
    def __init__(
        self,
        uri: str,
        auth: tuple[str, str],
        *,
        database: str = "routing",
        h3_resolution: int = 7,
        batch_size: int = 1_000,
        flush_interval: float = 5.0,
        max_concurrency: int = 10,
        max_retries: int = 3,
    ) -> None:
        self.driver = AsyncGraphDatabase.driver(
            uri,
            auth=auth,
            max_connection_pool_size=50,
            connection_acquisition_timeout=3.0,
            # Re-derive failed batches with a fresh version rather than blindly
            # letting the driver replay a half-applied transaction.
            max_transaction_retry_time=0,
        )
        self.database = database
        self.h3_resolution = h3_resolution
        self.batch_size = batch_size
        self.flush_interval = flush_interval
        self.max_retries = max_retries
        self.buffer: dict[str, list[dict]] = defaultdict(list)
        self._semaphore = asyncio.Semaphore(max_concurrency)
        self._flush_tasks: dict[str, asyncio.Task] = {}
        self.log = logging.getLogger("enricher")

    async def close(self) -> None:
        await self.driver.close()

    def _partition(self, lat: float, lon: float) -> str:
        """WGS84 -> H3 hexagon id used as a deterministic sharding key."""
        return h3.latlng_to_cell(lat, lon, self.h3_resolution)

    async def ingest(self, poi_id: str, lat: float, lon: float,
                     demographics: dict, version: int) -> None:
        if not validate_wgs84(lat, lon):
            self.log.warning("Dropping %s: invalid coordinates (%s, %s)", poi_id, lat, lon)
            return
        cell = self._partition(lat, lon)
        self.buffer[cell].append({
            "poi_id": poi_id,
            "demographics": demographics,
            "version": version,
            "ts": datetime.now(timezone.utc).isoformat(),
        })
        if len(self.buffer[cell]) >= self.batch_size:
            await self._flush(cell)
        elif cell not in self._flush_tasks:
            self._flush_tasks[cell] = asyncio.create_task(self._delayed_flush(cell))

    async def _delayed_flush(self, cell: str) -> None:
        await asyncio.sleep(self.flush_interval)
        await self._flush(cell)

    async def _flush(self, cell: str) -> None:
        async with self._semaphore:
            batch = self.buffer.pop(cell, [])
            self._flush_tasks.pop(cell, None)
            if not batch:
                return
            for attempt in range(1, self.max_retries + 1):
                try:
                    async with self.driver.session(database=self.database) as session:
                        result = await session.run(UPSERT_QUERY, batch=batch)
                        summary = await result.single()
                    self.log.info("Cell %s: %d/%d rows applied",
                                  cell, summary["applied"], len(batch))
                    return
                except (TransientError, ServiceUnavailable) as exc:
                    backoff = 0.25 * (2 ** (attempt - 1))
                    self.log.warning("Cell %s transient (%d/%d): %s — retry in %.2fs",
                                     cell, attempt, self.max_retries, exc, backoff)
                    await asyncio.sleep(backoff)
            # Requeue once exhausted so the next flush cycle re-attempts.
            self.log.error("Cell %s failed after %d retries — requeueing", cell, self.max_retries)
            self.buffer[cell].extend(batch)


async def main() -> None:
    enricher = DemographicEnricher("bolt://localhost:7687", ("neo4j", "password"))
    feed = [
        ("osm:node/42", 52.5200, 13.4050, {"foot_traffic": 0.82, "median_age": 34}, 7),
        ("osm:node/91", 52.5210, 13.4061, {"foot_traffic": 0.41, "median_age": 51}, 3),
    ]
    try:
        for poi_id, lat, lon, demo, ver in feed:
            await enricher.ingest(poi_id, lat, lon, demo, ver)
        # Drain any buffers that never reached batch_size.
        await asyncio.gather(*(enricher._flush(c) for c in list(enricher.buffer)))
    finally:
        await enricher.close()


if __name__ == "__main__":
    asyncio.run(main())

How It Works

Three mechanisms carry the guarantees, and each maps to a specific line:

UNWIND batch upsert. The Cypher turns a Python list into a relational row stream, so the planner runs one transactional sweep instead of N discrete writes. That collapses N lock-acquire/release cycles into one, which is what keeps demographic writes from monopolising the high-degree nodes that pathfinding also needs.
Monotonic version guard (WHERE coalesce(p.enrichment_version, 0) < rec.version). This is optimistic concurrency control without explicit locks. Mobility streams are unordered and at-least-once; the guard means a stale snapshot simply does not match and becomes a no-op, while the RETURN count(p) tells you how many rows actually mutated. Routing services can read enrichment_version to detect a stale snapshot without taking a read lock.
H3 partitioning (_partition + per-cell buffers). Bucketing by hexagon serialises co-located updates onto the same flush while distinct cells run concurrently under the semaphore. This is the cheapest defence against the deadlocks that plague unpartitioned concurrent SET workloads — the same index-and-locality discipline the spatial indexing strategies layer applies to point lookups, here applied to write fan-out.

Coordinate validation runs before the partition is computed: validate_wgs84 rejects out-of-range latitude/longitude so a malformed payload can never produce a colliding cell or raise inside h3.latlng_to_cell. The location point itself is owned upstream — see building automated OSM-to-graph ETL pipelines for how id and location are first established.

Common Failure Patterns

1. Last-write-wins clobbering with a bare SET. Dropping the version guard lets an older mobility snapshot overwrite a newer one whenever it commits second — silent, because the flush still reports success. Keep the guard in the query, never in application code:

// WRONG — newest snapshot is not guaranteed to survive:
MATCH (p:POI {id: rec.poi_id}) SET p.demographics = rec.demographics
// RIGHT — only a strictly newer version mutates the node:
MATCH (p:POI {id: rec.poi_id})
WHERE coalesce(p.enrichment_version, 0) < rec.version
SET p.demographics = rec.demographics, p.enrichment_version = rec.version

2. NodeByLabelScan during flush windows. Without the uniqueness constraint the planner resolves MATCH (p:POI {id: ...}) with a full label scan that scales linearly with graph cardinality and spikes CPU exactly when a batch lands. Confirm the plan before trusting throughput — you want a NodeUniqueIndexSeek, not a scan:

EXPLAIN
UNWIND [{poi_id: "osm:node/42", demographics: {foot_traffic: 0.82}, version: 7, ts: "2026-06-26T10:00:00Z"}] AS rec
MATCH (p:POI {id: rec.poi_id})
WHERE coalesce(p.enrichment_version, 0) < rec.version
SET p.demographics = rec.demographics, p.enrichment_version = rec.version

3. Partition skew at high H3 resolution. Resolution 8–9 cells are tiny, so rural feeds scatter into many near-empty buffers that flush on the timer instead of by size — multiplying transaction count and WAL pressure. Resolution 7 (~5 km²) keeps buffers dense enough to amortise commit overhead; only raise it where the feed is genuinely dense:

enricher = DemographicEnricher(uri, auth, h3_resolution=7, batch_size=1_000)

Performance Notes

Flush cost is dominated by how many rows actually mutate the store, not by how many payloads arrive. With at-least-once delivery the redundancy factor — duplicate or stale deliveries per logical update — sets the write-amplification budget. If $N$ logical updates arrive as $D$ deliveries, the version guard collapses committed writes toward $N$ while the planner still pays an index seek per delivery:

$$W_{\text{commit}} = N, \qquad C_{\text{seek}} = D \cdot c_{\text{idx}}, \qquad A = \frac{D}{N}$$

For a mobility feed with $A \approx 3$ (each POI re-reported three times per tick), two-thirds of seeks short-circuit at the WHERE filter without touching the property store or WAL. In-memory cost is bounded by batch_size × bytes_per_row per active cell, so the 1,000-row default holds in-flight payload to a few megabytes regardless of total feed volume. Watch three signals to hold routing SLAs: dbms.lock.wait.time (sustained waits >50ms mean flushes are colliding with active traversals), checkpoint latency (rising WAL volume forces ingestion throttling), and pool saturation (ConnectionAcquisitionTimeout silently drops payloads).

Switch strategies when $A$ climbs past ~10: debounce upstream by keeping only the highest-version payload per id before calling ingest, pushing $D$ back toward $N$. When committed writes — not seeks — are the bottleneck, profile the plan with the techniques in optimizing Cypher query plans for spatial data before adding workers; more concurrency only deepens lock queues if the write plan is already index-bound.

This guide is part of POI Enrichment Workflows, within the broader Spatial Graph Construction & OSM Ingestion guide.

Enriching POI Data with Real-Time Demographics

Prerequisites & Versions

Implementation

How It Works

Common Failure Patterns

Performance Notes

Related

Related pages

Siblings