Building Automated OSM to Graph ETL Pipelines

Routing solvers collapse the moment raw OpenStreetMap data reaches production with unresolved topological fragmentation: shortest-path queries return null traversals, A* heuristics loop on phantom self-edges, and dispatch latency spikes as the planner scans disconnected components. The root cause is almost always geometric debt baked in at load time — overlapping ways, floating nodes, and micro-duplications that fracture the adjacency model. This guide resolves that with a deterministic, idempotent ETL layer that snaps coordinates in metric space, deduplicates edges by an order-independent hash, and upserts the result through the async Neo4j driver so a re-run never corrupts an existing graph.

Prerequisites & Versions

This pipeline targets Python 3.11+ and Neo4j 5.x. The transformation stays in columnar memory (Arrow/NumPy) and never touches pandas, so it scales to regional extracts on a single worker.

Library	Min version	Install command
`neo4j` (async driver)	5.14	`pip install "neo4j>=5.14"`
`pyarrow`	14.0	`pip install "pyarrow>=14.0"`
`numpy`	1.26	`pip install "numpy>=1.26"`
`scipy`	1.11	`pip install "scipy>=1.11"`
`pyproj`	3.6	`pip install "pyproj>=3.6"`

Before loading, create the supporting constraint and index so MERGE resolves against the planner rather than scanning every Node:

CREATE CONSTRAINT node_id_unique IF NOT EXISTS
FOR (n:Node) REQUIRE n.id IS UNIQUE;

CREATE INDEX edge_hash_idx IF NOT EXISTS
FOR ()-[r:CONNECTS]-() ON (r.hash);

The uniqueness constraint also provisions a backing index, which is what makes the repeated MERGE (s:Node {id: ...}) calls O(log N) instead of O(N). Mapping raw ways onto this Node/CONNECTS shape follows the same conventions covered in node and edge spatial mapping.

Implementation

The pipeline has two halves: a pure transformation that produces a deduplicated Arrow table of edges, and an async loader that streams those edges into the graph through pooled connections. Both are self-contained and runnable.

import asyncio
import hashlib

import numpy as np
import pyarrow as pa
import pyarrow.compute as pc
from neo4j import AsyncGraphDatabase
from pyproj import Transformer
from scipy.spatial import cKDTree


class SpatialGraphETL:
    def __init__(self, uri: str, user: str, password: str, pool_size: int = 25):
        self.driver = AsyncGraphDatabase.driver(
            uri, auth=(user, password), max_connection_pool_size=pool_size
        )

    async def ingest_batch(self, batch_edges: list[dict]):
        """Transactional upsert with deterministic edge hashing."""
        query = """
        UNWIND $batch AS row
        MERGE (s:Node {id: row.source})
        MERGE (t:Node {id: row.target})
        MERGE (s)-[r:CONNECTS {hash: row.edge_hash}]->(t)
        ON CREATE SET r.weight = row.weight, r.surface = row.surface
        ON MATCH SET r.updated_at = timestamp()
        """
        async with self.driver.session() as session:
            await session.run(query, batch=batch_edges)

    async def ingest_table(self, edges: pa.Table, batch_size: int = 25_000,
                           max_inflight: int = 8):
        """Stream an Arrow table into the graph with bounded concurrency."""
        rows = edges.to_pylist()
        sem = asyncio.Semaphore(max_inflight)

        async def _send(chunk: list[dict]):
            async with sem:
                await self.ingest_batch(chunk)

        tasks = [
            _send(rows[i:i + batch_size])
            for i in range(0, len(rows), batch_size)
        ]
        await asyncio.gather(*tasks)

    async def close(self):
        await self.driver.close()


def normalize_topology_pyarrow(
    nodes_table: pa.Table,
    edges_table: pa.Table,
    tolerance_m: float = 1.5,
) -> pa.Table:
    """Snap proximate nodes via a metric k-d tree, then emit deduplicated edges
    that carry every attribute the Cypher upsert expects.

    ``nodes_table`` columns: ``node_id``, ``lat``, ``lon``.
    ``edges_table`` columns: ``source``, ``target``, ``weight``, ``surface``.
    Returns an Arrow table with columns: ``source``, ``target``, ``edge_hash``,
    ``weight``, ``surface``.
    """
    # 1. Project to metric space for accurate Euclidean distance
    transformer = Transformer.from_crs("EPSG:4326", "EPSG:3857", always_xy=True)
    x, y = transformer.transform(
        nodes_table.column("lon").to_numpy(),
        nodes_table.column("lat").to_numpy(),
    )

    # 2. k-d tree spatial index & proximity query
    tree = cKDTree(np.column_stack((x, y)))
    pairs = tree.query_pairs(r=tolerance_m, output_type="ndarray")

    # 3. Union-Find for connected component resolution
    parent = np.arange(len(nodes_table))

    def find(i: int) -> int:
        path = []
        while parent[i] != i:
            path.append(i)
            i = parent[i]
        for node in path:
            parent[node] = i
        return i

    def union(i: int, j: int) -> None:
        root_i, root_j = find(i), find(j)
        if root_i != root_j:
            parent[root_j] = root_i

    for i, j in pairs:
        union(int(i), int(j))

    # 4. Canonical id mapping (node row index -> canonical row index -> node_id)
    node_ids = nodes_table.column("node_id").to_numpy()
    id_index = {nid: idx for idx, nid in enumerate(node_ids)}
    canonical_for = np.array([node_ids[find(i)] for i in range(len(node_ids))])

    src_canonical = np.array([canonical_for[id_index[s]] for s in edges_table.column("source").to_numpy()])
    tgt_canonical = np.array([canonical_for[id_index[t]] for t in edges_table.column("target").to_numpy()])

    # 5. Drop self-loops introduced by snapping
    keep = src_canonical != tgt_canonical
    src_canonical = src_canonical[keep]
    tgt_canonical = tgt_canonical[keep]
    weights = edges_table.column("weight").to_numpy()[keep]
    surfaces = edges_table.column("surface").to_numpy(zero_copy_only=False)[keep]

    # 6. Deterministic, order-independent edge hash
    edge_hashes = np.array([
        hashlib.sha256(f"{min(s, t)}|{max(s, t)}".encode()).hexdigest()
        for s, t in zip(src_canonical, tgt_canonical)
    ])

    edges = pa.table({
        "source": src_canonical,
        "target": tgt_canonical,
        "edge_hash": edge_hashes,
        "weight": weights,
        "surface": surfaces,
    })

    # Arrow has no drop_duplicates(); group by the hash and keep the first row.
    grouped = edges.group_by("edge_hash").aggregate([
        ("source", "first"), ("target", "first"),
        ("weight", "first"), ("surface", "first"),
    ])
    return grouped.rename_columns(["edge_hash", "source", "target", "weight", "surface"])


async def run_pipeline(uri, user, password, nodes_table, edges_table):
    edges = normalize_topology_pyarrow(nodes_table, edges_table, tolerance_m=1.5)
    etl = SpatialGraphETL(uri, user, password)
    try:
        await etl.ingest_table(edges)
    finally:
        await etl.close()

The Cypher executed per batch is the contract between transformation and storage. MERGE on the deterministic hash is what makes a re-run a no-op instead of a duplication event:

// Idempotent edge ingestion with schema enforcement
UNWIND $batch AS row
MERGE (s:Node {id: row.source})
MERGE (t:Node {id: row.target})
MERGE (s)-[r:CONNECTS {hash: row.edge_hash}]->(t)
ON CREATE SET r.weight = row.weight, r.surface = row.surface
ON MATCH SET r.updated_at = timestamp()

How It Works

Spatial snapping on raw latitude/longitude is mathematically invalid for meter-based tolerances because one degree of longitude shrinks toward the poles. Step 1 projects every node into Web Mercator (EPSG:3857) so the tolerance_m radius is an honest Euclidean distance. Step 2 builds a cKDTree and calls query_pairs, which returns only the node pairs closer than the tolerance — turning an O(N²) all-pairs comparison into a localized index probe.

Steps 3 and 4 are the part most naive snappers get wrong. When three or more endpoints sit within tolerance of each other, pairwise merging produces inconsistent results depending on iteration order. The union-find structure (with path compression in find) collapses each proximity cluster into a single canonical node deterministically, so every edge endpoint resolves to a stable id regardless of input ordering.

Step 5 drops self-loops created when both endpoints of a short segment snap to the same canonical node, and step 6 computes the edge_hash from the sorted (min, max) id pair. Because the hash is order-independent, an undirected street loaded as A→B in one extract and B→A in the next collapses to one relationship. That single property is what lets the MERGE in the loader stay idempotent across repeated regional imports. The bounded-concurrency ingest_table then saturates network bandwidth without exhausting the connection pool — the same async batching principle developed in depth under async batch processing for graphs.

Common Failure Patterns

Self-loops survive snapping and poison routing weights. If step 5 is skipped, a snapped micro-segment becomes a zero-length CONNECTS relationship that A* will happily traverse forever. Guard it both in transform and at the query layer:

MATCH (n:Node)-[r:CONNECTS]->(n)
DELETE r;

KeyError during canonical id mapping. When edges_table references a source or target that is missing from nodes_table (a common artifact of clipping ways at a bounding-box boundary), the id_index[s] lookup raises. Filter dangling edges before remapping rather than letting the comprehension crash:

valid = set(nodes_table.column("node_id").to_pylist())
mask = pc.and_(
    pc.is_in(edges_table.column("source"), value_set=pa.array(valid)),
    pc.is_in(edges_table.column("target"), value_set=pa.array(valid)),
)
edges_table = edges_table.filter(mask)

Connection pool exhaustion under unbounded fan-out. Calling asyncio.gather over every batch without a semaphore opens more sessions than max_connection_pool_size allows; the driver then blocks or raises acquisition timeouts mid-load. The max_inflight semaphore in ingest_table caps concurrent sessions below the pool ceiling — keep max_inflight at roughly one-third of pool_size to leave headroom for retries.

Performance Notes

Continental extracts will not fit a single cKDTree in heap, so the pipeline partitions by administrative boundary or UTM zone. Peak memory and partition count follow directly from the partition size $N_{\text{part}}$:

$$ P = \left\lceil \frac{N}{N_{\text{part}}} \right\rceil, \qquad M_{\text{peak}} \approx N_{\text{part}},\bigl(c_{\text{kd}} + c_{\text{arrow}}\bigr) $$

where $c_{\text{kd}} \approx 8\text{–}12$ bytes per coordinate pair for the tree and $c_{\text{arrow}}$ is the columnar buffer footprint per row. With $N_{\text{part}} = 500{,}000$, each partition tree builds in roughly $O(N_{\text{part}} \log N_{\text{part}})$ and stays well under a 1 GB worker budget, so partitions can be processed in parallel.

Tolerance is the dominant precision/throughput knob. A tight tolerance (≤0.5 m) preserves curb-level accuracy but inflates vertex count and adjacency sparsity; a loose tolerance (≥3.0 m) accelerates ingestion but fabricates shortcuts in pedestrian networks. For mixed road classes, scale tolerance dynamically — 1.0 m for residential streets, 2.5 m for motorways. Switch from in-memory snapping to a tiled, disk-backed strategy once a single partition exceeds available heap. After load, verify topology rather than trusting it: confirm the largest connected component holds ≥99.8% of routable edges, then push routing weight tuning into Cypher performance tuning and index selection into spatial indexing strategies.

OSM Data Ingestion Pipelines — the ingestion pipeline guide this page belongs to.
Attribute Synchronization Techniques — keep surface, weight, and POI metadata current after the initial load.
POI Enrichment Workflows — attach demographic and amenity context to the normalized node set.
Async Batch Processing for Graphs — scale the loader with asyncio backpressure controls.

This guide is part of the OSM Data Ingestion Pipelines collection, which sits within the broader Spatial Graph Construction & OSM Ingestion pillar.

Building Automated OSM to Graph ETL Pipelines

Prerequisites & Versions

Implementation

How It Works

Common Failure Patterns

Performance Notes

Related

Related pages

Siblings