Syncing External Attribute Changes to Graph Nodes

The symptom that brings teams to this page is a routing graph that slowly goes wrong while the loader reports success: traffic-speed updates land out of order, a stale weight_factor overwrites a fresher one, and the next shortest-path query returns a route that no longer exists on the ground. The root cause is treating attribute sync as a blind SET — external telemetry (traffic feeds, sensor pollers, GTFS-RT vehicle positions) arrives unordered, at-least-once, and partially malformed, so the last write to commit wins regardless of which write is newest. This page resolves that with one runnable worker that applies monotonic version-guarded upserts, batches writes to bound transaction-log pressure, validates coordinate drift before mutating, and partitions work by geographic bucket so concurrent updates to adjacent road segments never deadlock.

Prerequisites & Versions

Library	Min version	Install
Python	3.11	`asyncio.TaskGroup`, `tuple[str, str]` typing
`neo4j` async driver	5.14	`pip install "neo4j>=5.14"`
Neo4j server	5.x	`docker run -p7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5`

This guide assumes the graph already exists: the RoutingNode records being updated here are produced upstream by the OSM data ingestion pipelines stage, and each node already carries a location point and an external_id that the source feed references. If you are still loading the base topology, start with scaling async graph ingestion with Python asyncio first — attribute sync is a write-back layer on top of that graph, not a replacement for it.

Implementation

The schema below anchors the upsert on a uniqueness constraint over external_id (so the MATCH is an index seek, not a label scan) and tracks an attr_version per node. The AttributeSyncWorker class then validates incoming payloads, buckets them by geohash prefix to serialize neighbours, chunks each bucket, and applies a version-guarded SET inside a managed transaction. The whole module is self-contained and runnable.

CREATE CONSTRAINT routing_node_extid IF NOT EXISTS
FOR (n:RoutingNode) REQUIRE n.external_id IS UNIQUE;

CREATE POINT INDEX routing_node_location IF NOT EXISTS
FOR (n:RoutingNode) ON (n.location);

import asyncio
import logging
import math
import time
from itertools import islice
from typing import Any, Iterable, Iterator

from neo4j import AsyncGraphDatabase
from neo4j.exceptions import TransientError, ServiceUnavailable

logging.basicConfig(level=logging.INFO)

# Version-guarded upsert: the WHERE clause is the entire concurrency story.
# A SET only fires when the incoming version strictly beats the stored one,
# so out-of-order and duplicate deliveries collapse to the newest write.
SYNC_QUERY = """
UNWIND $batch AS u
MATCH (n:RoutingNode {external_id: u.external_id})
USING INDEX n:RoutingNode(external_id)
WHERE n.attr_version < u.version
SET n.status        = u.status,
    n.weight_factor = u.weight_factor,
    n.attr_version  = u.version,
    n.last_synced   = u.timestamp
RETURN n.external_id AS synced_id, n.attr_version AS new_version
"""


def haversine_m(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    """Great-circle distance in metres (WGS84 sphere)."""
    r = 6_371_000.0
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (
        math.sin(dlat / 2) ** 2
        + math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) * math.sin(dlon / 2) ** 2
    )
    return 2 * r * math.asin(math.sqrt(a))


def geo_bucket(lat: float, lon: float, precision: int = 2) -> str:
    """Coarse spatial key: payloads sharing a bucket touch adjacent nodes,
    so we serialise them onto the same worker to avoid lock contention."""
    return f"{round(lat, precision)}:{round(lon, precision)}"


def chunked(items: Iterable[dict], size: int) -> Iterator[list[dict]]:
    it = iter(items)
    while chunk := list(islice(it, size)):
        yield chunk


class AttributeSyncWorker:
    REQUIRED = ("external_id", "version", "timestamp", "lat", "lon", "base_lat", "base_lon")

    def __init__(
        self,
        uri: str,
        auth: tuple[str, str],
        *,
        database: str = "routing",
        batch_size: int = 2_500,
        max_drift_m: float = 500.0,
        max_retries: int = 3,
    ) -> None:
        self.driver = AsyncGraphDatabase.driver(
            uri,
            auth=auth,
            max_connection_pool_size=40,
            connection_acquisition_timeout=4.0,
            # Surface transient errors to the app so a failed batch is re-derived
            # with a fresh version, never blindly re-sent mid-partition.
            max_transaction_retry_time=0,
        )
        self.database = database
        self.batch_size = batch_size
        self.max_drift_m = max_drift_m
        self.max_retries = max_retries
        self.log = logging.getLogger("attr_sync")

    async def close(self) -> None:
        await self.driver.close()

    def _accept(self, p: dict[str, Any]) -> bool:
        """Reject at the boundary: missing keys or coordinate drift beyond
        budget means the payload is stale or misaligned — never let it write."""
        if any(k not in p or p[k] is None for k in self.REQUIRED):
            self.log.warning("Dropping payload missing required keys: %s", p.get("external_id"))
            return False
        drift = haversine_m(p["lat"], p["lon"], p["base_lat"], p["base_lon"])
        if drift > self.max_drift_m:
            self.log.warning("Dropping %s: %.0fm drift exceeds budget", p["external_id"], drift)
            return False
        return True

    async def _apply_chunk(self, session, chunk: list[dict]) -> int:
        for attempt in range(1, self.max_retries + 1):
            start = time.perf_counter()
            try:
                result = await session.run(SYNC_QUERY, batch=chunk)
                synced = [r["synced_id"] async for r in result]
                elapsed_ms = (time.perf_counter() - start) * 1000
                if elapsed_ms > 50:
                    self.log.warning(
                        "Slow chunk: %.0fms for %d rows — check lock waits or index health.",
                        elapsed_ms, len(chunk),
                    )
                return len(synced)
            except (TransientError, ServiceUnavailable) as exc:
                backoff = 0.25 * (2 ** (attempt - 1))
                self.log.warning("Transient (%d/%d): %s — retry in %.2fs",
                                 attempt, self.max_retries, exc, backoff)
                await asyncio.sleep(backoff)
        self.log.error("Chunk permanently failed after %d retries", self.max_retries)
        return 0

    async def sync(self, payloads: Iterable[dict[str, Any]]) -> int:
        # Bucket first so neighbouring nodes serialise; distinct buckets run
        # in parallel without ever racing for the same node lock.
        buckets: dict[str, list[dict]] = {}
        for p in payloads:
            if self._accept(p):
                buckets.setdefault(geo_bucket(p["lat"], p["lon"]), []).append(p)

        total = 0
        async with asyncio.TaskGroup() as tg:
            tasks = [tg.create_task(self._sync_bucket(rows)) for rows in buckets.values()]
        for t in tasks:
            total += t.result()
        self.log.info("Sync complete: %d nodes updated across %d buckets", total, len(buckets))
        return total

    async def _sync_bucket(self, rows: list[dict]) -> int:
        applied = 0
        async with self.driver.session(database=self.database) as session:
            for chunk in chunked(rows, self.batch_size):
                applied += await self._apply_chunk(session, chunk)
        return applied


async def main() -> None:
    worker = AttributeSyncWorker("bolt://localhost:7687", ("neo4j", "password"))
    feed = [
        {"external_id": "osm:node/42", "version": 17, "timestamp": "2026-06-26T09:00:00Z",
         "status": "congested", "weight_factor": 2.4,
         "lat": 52.5200, "lon": 13.4050, "base_lat": 52.5200, "base_lon": 13.4050},
    ]
    try:
        await worker.sync(feed)
    finally:
        await worker.close()


if __name__ == "__main__":
    asyncio.run(main())

How It Works

Three mechanisms carry the correctness guarantees, and each maps to a specific line:

Monotonic version guard (WHERE n.attr_version < u.version). This single predicate is optimistic concurrency control without row locks. When two workers process overlapping updates for the same node, both MATCH, but only the one whose version exceeds the stored value commits the SET; the loser’s pattern simply does not match and is a no-op. Because the operation is idempotent, replaying an at-least-once delivery is harmless — re-applying version 17 over a stored version 17 changes nothing.
Index-seek upsert (USING INDEX n:RoutingNode(external_id)). The hint pins the planner to the uniqueness-backed index even when skewed batch cardinality would otherwise tempt it into a full label scan. Run EXPLAIN on SYNC_QUERY and confirm a NodeUniqueIndexSeek precedes the Filter; if you see NodeByLabelScan, the constraint is missing or stale. This is the same index-discipline the spatial indexing strategies layer applies to point lookups, applied here to attribute lookups.
Geographic bucketing (geo_bucket + per-bucket sessions). Updates that touch adjacent road segments — a congested corridor, a closed junction — land in the same coarse bucket and run on one session, so they serialize naturally. Distinct buckets run concurrently under TaskGroup and never contend for the same node lock. This is the cheapest way to avoid the deadlocks that plague unpartitioned concurrent SET workloads.

Coordinate validation runs before any write: _accept drops payloads whose reported position has drifted more than the budget from the node’s known location, which is how stale or misattributed telemetry is kept out of the routing graph entirely. The location itself is owned by the upstream mapping layer — see how to map road networks to graph nodes and edges for how external_id and location are first established.

Common Failure Patterns

1. Last-write-wins clobbering with a bare SET. Dropping the version guard lets an older payload overwrite a newer one whenever it happens to commit second — the defect is silent because the loader still reports success. The fix is to make the guard unconditional in the query, never in application code:

// WRONG — newest write is not guaranteed to survive:
MATCH (n:RoutingNode {external_id: u.external_id}) SET n.weight_factor = u.weight_factor
// RIGHT — only a strictly newer version mutates the node:
MATCH (n:RoutingNode {external_id: u.external_id})
WHERE n.attr_version < u.version SET n.weight_factor = u.weight_factor, n.attr_version = u.version

2. Lock escalation from unpartitioned concurrent updates. Fanning every chunk out to its own task without bucketing makes two batches race for the same hot intersection node, surfacing as TransientError: DeadlockDetected and LockWaitTime spikes above 50 ms. Bucketing by geo_bucket removes the contention; if a single bucket is still hot, shrink the chunk and lengthen the serial window:

worker = AttributeSyncWorker(uri, auth, batch_size=1_000)  # smaller chunks, shorter lock hold

3. Transaction-log pressure from unbounded batches. A single multi-megabyte UNWIND forces WAL disk spills and checkpoint stalls that look like random latency cliffs. Keeping chunks near 2,500 rows holds each transaction’s log footprint well under the checkpoint threshold; the chunked generator enforces this regardless of feed size. Pair it with a strict boundary validator so a malformed batch never enters a transaction at all.

Performance Notes

Sync cost is dominated by how many writes actually mutate the store, not by how many payloads arrive. With at-least-once delivery, the redundancy factor — duplicate or stale deliveries per logical update — sets the write-amplification budget. If $N$ logical updates arrive as $D$ deliveries, the version guard collapses them so that committed writes track $N$ while the planner still pays an index seek per delivery:

$$W_{\text{commit}} = N, \qquad C_{\text{seek}} = D \cdot c_{\text{idx}}, \qquad A = \frac{D}{N}$$

For a traffic feed with $A \approx 3$ (each segment re-reported three times before the next tick), two-thirds of seeks are no-ops — cheap, because they short-circuit at the WHERE filter without touching the property store or the WAL. That asymmetry is exactly why the guard belongs in Cypher: the database discards stale work before it becomes a write. Memory stays flat at roughly batch_size × bytes_per_row per active bucket, so the 2,500-row default caps in-flight payload near a few megabytes regardless of total feed volume.

Switch strategies when $A$ climbs past ~10 (a very chatty feed): debounce upstream by keeping only the highest-version payload per external_id before calling sync, turning $D$ back toward $N$. When committed writes themselves are the bottleneck rather than seeks, profile the plan with the techniques in optimizing Cypher query plans for spatial data before adding workers — more concurrency only deepens lock queues if the write plan is already index-bound.

This guide is part of Attribute Synchronization Techniques, within the broader Spatial Graph Construction & OSM Ingestion guide.

Syncing External Attribute Changes to Graph Nodes

Prerequisites & Versions

Implementation

How It Works

Common Failure Patterns

Performance Notes

Related

Related pages

Siblings