Scaling Async Graph Ingestion with Python Asyncio

A continental OpenStreetMap extract contains tens of millions of directed edges, and a single urban intersection alone can spawn dozens of relationships carrying turn restrictions, speed classes, and lane geometry. The symptom that brings teams to this page is always the same: a synchronous loader that ran fine on a city extract throws Neo4jError: ConnectionAcquisitionTimeout (or simply pins one CPU core at 100% while the database idles) the moment it hits a country-sized file. The root cause is that the bottleneck is network round-trip latency, not CPU — yet the code waits on each write serially. This page resolves that with a single async ingestor class that uses asyncio concurrency, semaphore backpressure, batched UNWIND, and chunk-level telemetry to push country-scale graphs into Neo4j without exhausting the connection pool or the heap.

Prerequisites & Versions

Library	Min version	Install
Python	3.11	(async `TaskGroup`, `tuple[str, str]` typing)
`neo4j` async driver	5.14	`pip install "neo4j>=5.14"`
Neo4j server	5.x	`docker run -p7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5`
`osmium` (intermediate extract)	3.6	`pip install osmium`

This guide assumes you already have parsed OSM ways flattened into edge records — the upstream OSM data ingestion pipelines layer is responsible for turning raw .osm.pbf ways into the source/target/highway rows consumed below. If you have not built that stage yet, start with building automated OSM-to-graph ETL pipelines first.

Implementation

The complete ingestor below is self-contained and runnable. It creates the spatial schema, streams edge records from an async generator, dispatches bounded batches concurrently, retries transient failures, and reports chunk latency. Replace stream_edges_from_file with the output of your own parser; the newline-delimited JSON shape ({"source", "target", "source_lon", ...}) matches what a flattened OSM way export produces.

import asyncio
import json
import logging
import time
from pathlib import Path
from typing import Any, AsyncGenerator

from neo4j import AsyncGraphDatabase
from neo4j.exceptions import TransientError, ServiceUnavailable

logging.basicConfig(level=logging.INFO)


class AsyncGraphIngestor:
    def __init__(
        self,
        uri: str,
        auth: tuple[str, str],
        max_concurrency: int = 32,
        batch_size: int = 5_000,
        max_retries: int = 3,
    ) -> None:
        # Pool sized to 2x the semaphore absorbs connection lifecycle churn
        # (rollbacks, keep-alives) without ever starving an active worker.
        self.driver = AsyncGraphDatabase.driver(
            uri,
            auth=auth,
            max_connection_pool_size=max_concurrency * 2,
            connection_acquisition_timeout=30.0,
        )
        self.semaphore = asyncio.Semaphore(max_concurrency)
        self.batch_size = batch_size
        self.max_retries = max_retries
        self.log = logging.getLogger("spatial_ingest")

    async def close(self) -> None:
        await self.driver.close()

    async def ensure_schema(self) -> None:
        """Idempotent: a uniqueness constraint anchors MERGE, the point
        index turns later distance filters into bounding-box seeks."""
        async with self.driver.session() as session:
            await session.run(
                "CREATE CONSTRAINT node_id IF NOT EXISTS "
                "FOR (n:Node) REQUIRE n.id IS UNIQUE"
            )
            await session.run(
                "CREATE POINT INDEX node_location IF NOT EXISTS "
                "FOR (n:Node) ON (n.location)"
            )

    @staticmethod
    def _ingest_query() -> str:
        return """
        UNWIND $batch AS edge
        MERGE (u:Node {id: edge.source})
          ON CREATE SET u.location =
            point({longitude: edge.source_lon, latitude: edge.source_lat})
        MERGE (v:Node {id: edge.target})
          ON CREATE SET v.location =
            point({longitude: edge.target_lon, latitude: edge.target_lat})
        MERGE (u)-[r:CONNECTED_TO {type: edge.highway}]->(v)
          SET r.length_meters = edge.length,
              r.speed_kph = edge.speed,
              r.bearing = edge.bearing
        """

    async def _execute_chunk(self, batch: list[dict[str, Any]]) -> int:
        query = self._ingest_query()
        async with self.semaphore:  # backpressure: cap in-flight transactions
            for attempt in range(1, self.max_retries + 1):
                start = time.perf_counter()
                try:
                    async with self.driver.session() as session:
                        result = await session.run(query, batch=batch)
                        await result.consume()
                    elapsed_ms = (time.perf_counter() - start) * 1000
                    if elapsed_ms > 200:
                        self.log.warning(
                            "Slow chunk: %.0fms for %d edges — check index "
                            "health or hot-node lock contention.",
                            elapsed_ms, len(batch),
                        )
                    return len(batch)
                except (TransientError, ServiceUnavailable) as exc:
                    backoff = 0.25 * (2 ** (attempt - 1))
                    self.log.warning(
                        "Transient failure (attempt %d/%d): %s — retry in %.2fs",
                        attempt, self.max_retries, exc, backoff,
                    )
                    await asyncio.sleep(backoff)
            self.log.error("Chunk permanently failed after %d retries", self.max_retries)
            return 0

    async def ingest(self, edges: AsyncGenerator[dict[str, Any], None]) -> None:
        await self.ensure_schema()
        chunk: list[dict[str, Any]] = []
        ingested = 0
        async with asyncio.TaskGroup() as tg:
            async for edge in edges:
                chunk.append(edge)
                if len(chunk) >= self.batch_size:
                    batch, chunk = chunk, []
                    tg.create_task(self._execute_chunk(batch))
            if chunk:
                tg.create_task(self._execute_chunk(chunk))
        # TaskGroup awaits every task; tally happens after the block exits.
        self.log.info("Ingestion complete.")


async def stream_edges_from_file(path: str) -> AsyncGenerator[dict[str, Any], None]:
    """Yields one parsed edge dict per line so resident memory stays flat
    regardless of total file size."""
    loop = asyncio.get_running_loop()
    with Path(path).open() as fh:
        for line in fh:
            line = line.strip()
            if line:
                # Offload CPU-bound JSON parse from the event loop thread.
                yield await loop.run_in_executor(None, json.loads, line)


async def main() -> None:
    ingestor = AsyncGraphIngestor(
        uri="bolt://localhost:7687",
        auth=("neo4j", "password"),
        max_concurrency=32,
        batch_size=5_000,
    )
    try:
        await ingestor.ingest(stream_edges_from_file("edges.ndjson"))
    finally:
        await ingestor.close()


if __name__ == "__main__":
    asyncio.run(main())

How It Works

Three mechanisms in the code do the heavy lifting, and each maps to a specific line:

Semaphore backpressure (async with self.semaphore). Without it, TaskGroup would create one coroutine per batch immediately, opening thousands of simultaneous sessions — the thundering herd that triggers ConnectionAcquisitionTimeout. The semaphore caps concurrently executing transactions at max_concurrency while tasks beyond that limit wait cheaply. This is why max_connection_pool_size is set to 2 * max_concurrency: the pool always has headroom for connection churn underneath the active workers.
Batched UNWIND (UNWIND $batch AS edge). Each network round trip carries 5,000 edges instead of one. The MERGE on a unique id (backed by the node_id constraint) is an index seek, not a scan, so write cost stays roughly linear in edge count rather than quadratic.
Streaming source (stream_edges_from_file). Because the generator yields one dict at a time and ingest flushes the chunk buffer every batch_size rows, peak resident memory is bounded by batch_size × in-flight chunks, not by the size of the OSM extract. The run_in_executor call keeps the CPU-bound json.loads off the event loop so I/O coroutines are never blocked.

Coordinates follow Neo4j’s point({longitude, latitude}) convention (WGS84 / EPSG:4326). Populating location during ingest is what lets the spatial indexing strategies layer answer later distance queries with bounding-box seeks instead of full label scans. For how these Node/CONNECTED_TO records map back to real road geometry, see how to map road networks to graph nodes and edges.

Common Failure Patterns

1. Connection pool exhaustion under unbounded fan-out. Dropping the semaphore (or sizing the pool below the concurrency limit) is the most common cause of ConnectionAcquisitionTimeout. The invariant to preserve:

# Pool must always exceed the semaphore, never the reverse.
assert max_connection_pool_size >= max_concurrency
ingestor = AsyncGraphIngestor(uri, auth, max_concurrency=32)  # pool = 64

2. Deadlocks from concurrent MERGE on shared hot nodes. Two batches that both MERGE the same high-degree intersection node race for the same lock and surface as TransientError: DeadlockDetected. The fix is already wired in: catch TransientError, back off exponentially, and retry — the operation is idempotent because MERGE is. Do not retry on generic Exception, which would mask schema or syntax bugs:

except (TransientError, ServiceUnavailable) as exc:
    await asyncio.sleep(0.25 * (2 ** (attempt - 1)))

3. Swallowing failures with return_exceptions=True. A bare asyncio.gather(*tasks, return_exceptions=True) reports “success” while half the batches silently failed. Using asyncio.TaskGroup instead propagates the first unhandled error and cancels siblings, so a corrupt batch surfaces immediately instead of after a four-hour run. Validate edge records before enqueueing — a null source_lon will poison the whole chunk.

Performance Notes

Throughput is governed by Little’s Law, not by raw loop speed. With mean per-batch latency $L$ seconds and concurrency $C$, sustained batch throughput is:

$$\text{batches/sec} = \frac{C}{L}, \qquad T_{\text{total}} \approx \frac{N}{B} \cdot \frac{L}{C}$$

where $N$ is total edges and $B$ is batch_size. For $N = 40{,}000{,}000$ edges, $B = 5{,}000$, $L = 0.12$ s, and $C = 32$, that predicts $T_{\text{total}} \approx 30$ s of database time — so if a run takes ten minutes, the limiter is $L$ (index fragmentation or lock contention), not insufficient concurrency. Raising $C$ past the point where $L$ starts climbing only deepens lock queues.

Memory budget: peak heap is roughly $B \times C \times s$ where $s$ is the serialized size of one edge (~250 bytes), so the defaults cap in-flight payload near 40 MB regardless of extract size. Switch from this batched strategy to a server-side apoc.periodic.iterate or LOAD CSV import only for cold bulk loads where no incremental attribute synchronization is needed; the async path here wins whenever you ingest continuously alongside live reads. When latency stays high despite a healthy pool, profile the write plan using the techniques in optimizing Cypher query plans for spatial data.

This guide is part of Async Batch Processing for Graphs, within the broader Spatial Graph Construction & OSM Ingestion guide.

Scaling Async Graph Ingestion with Python Asyncio

Prerequisites & Versions

Implementation

How It Works

Common Failure Patterns

Performance Notes

Related

Related pages

Siblings