Spatial Graph Construction & OSM Ingestion

Spatial Graph Construction & OSM Ingestion demands deterministic topology mapping and strict memory governance. Raw OpenStreetMap extracts deliver highly compressed Protobuf (PBF) geometries and unstructured XML metadata that must resolve into directed, weighted edge networks suitable for production routing. Naive node-per-coordinate ingestion strategies fail under metropolitan-scale density, causing heap fragmentation, transaction log bloat, and routing discontinuities. Production systems require spatial partitioning, explicit concurrency limits, and batched transaction boundaries from deployment day one.

The ingestion architecture decomposes into three discrete pipeline stages: stream deserialization, topology resolution, and graph materialization. Each stage operates under bounded resource constraints to prevent driver connection pool exhaustion and ensure predictable latency during peak ingestion windows.

flowchart LR
  subgraph S1["1 — Stream deserialization"]
    PBF["PBF blocks"] --> Tags["Tag filter + WGS84 projection"]
  end
  subgraph S2["2 — Topology resolution"]
    Snap["Coordinate snapping<br/>0.3–0.5 m tolerance"] --> Dir["Directional edges<br/>oneway + restrictions"]
  end
  subgraph S3["3 — Graph materialization"]
    Batch["UNWIND batches<br/>50k nodes / 200k edges"] --> Graph[("Routing graph")]
  end
  Tags --> Snap
  Dir --> Batch
  classDef s1 fill:#fbfaf6,stroke:#1b3a4b,color:#1b2330;
  classDef s2 fill:#f6f0e6,stroke:#b58900,color:#5b3a0d;
  classDef s3 fill:#e9f5f4,stroke:#0e7c86,color:#0e7c86;
  class PBF,Tags s1
  class Snap,Dir s2
  class Batch,Graph s3

Stream Deserialization & Memory Governance

PBF files must be parsed using zero-copy stream deserialization to avoid loading entire regional extracts into RAM. The OpenStreetMap PBF Format Specification defines dense coordinate arrays and string table compression that require sequential block iteration. We parse Node, Way, and Relation blocks concurrently, mapping OSM tags to graph properties during extraction. Coordinate projection, unit normalization, and tag filtering occur before any database interaction. Production configurations for these stages are documented in OSM Data Ingestion Pipelines.

Python 3.10+ asyncio drivers enable non-blocking batch commits under heavy I/O load. We enforce strict transaction caps (50k nodes, 200k edges) to respect write-ahead log limits and prevent lock contention. Memory-mapped buffers and generator-based batching reduce garbage collection pauses during peak ingestion windows.

import asyncio
import math
from typing import AsyncIterator, Dict, List
from neo4j import AsyncGraphDatabase
from neo4j.exceptions import Neo4jError

EARTH_RADIUS_KM = 6371.0

def haversine_distance(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    """Compute great-circle distance in meters for topology validation."""
    φ1, φ2 = math.radians(lat1), math.radians(lat2)
    Δφ = math.radians(lat2 - lat1)
    Δλ = math.radians(lon2 - lon1)
    a = math.sin(Δφ/2)**2 + math.cos(φ1) * math.cos(φ2) * math.sin(Δλ/2)**2
    return 2 * EARTH_RADIUS_KM * 1000 * math.asin(math.sqrt(a))

async def ingest_spatial_batch(
    uri: str, auth: tuple, batch: List[Dict], max_pool: int = 16
) -> None:
    driver = AsyncGraphDatabase.driver(
        uri, auth=auth, max_connection_pool_size=max_pool,
        connection_acquisition_timeout=10.0
    )
    query = """
    UNWIND $batch AS item
    MERGE (n:Node {id: item.node_id})
    SET n.location = point({latitude: item.lat, longitude: item.lon}),
        n.osm_type = item.type,
        n.elevation = item.elevation
    WITH n, item
    OPTIONAL MATCH (m:Node {id: item.source_id})
    WHERE m IS NOT NULL
    MERGE (m)-[r:CONNECTS_TO]->(n)
    SET r.distance_m = item.dist_m,
        r.surface = item.surface,
        r.max_speed_kph = item.max_speed_kph
    """
    async with driver.session(database="spatial_routing") as session:
        try:
            result = await session.run(query, batch=batch)
            await result.consume()
        except Neo4jError as e:
            raise RuntimeError(f"Batch commit failed: {e}") from e
    await driver.close()

Topology Resolution & Spatial Partitioning

Topology correctness depends on precise coordinate snapping and intersection resolution. Raw OSM coordinates rarely align perfectly due to survey variance, GPS drift, and multi-source contributor edits. We apply a configurable tolerance threshold (typically 0.3–0.5 meters) to merge duplicate junctions. Haversine distance calculations validate edge continuity before committing to the graph. Enforcing strict Graph Validation & Integrity Rules prevents routing anomalies like phantom edges and disconnected subgraphs.

Memory pressure scales linearly with coordinate density unless we apply spatial partitioning. Geohash-based sharding and Quadtree spatial filters isolate regional subgraphs before ingestion. By bounding each partition to a fixed geographic extent, we parallelize ingestion across worker pools without cross-node lock contention. Coordinate transformations should leverage established libraries like PROJ to ensure consistent datum alignment before snapping operations.

def snap_junctions(coords: List[tuple], tolerance_m: float = 0.5) -> List[tuple]:
    """Merge coordinates within tolerance using grid-based hashing."""
    grid = {}
    snapped = []
    for lat, lon in coords:
        # Convert tolerance to approximate degree delta at equator
        delta = tolerance_m / 111320.0
        key = (round(lat / delta), round(lon / delta))
        if key not in grid:
            grid[key] = (lat, lon)
            snapped.append((lat, lon))
    return snapped

Directed Edge Materialization & Routing Readiness

Parsed Way entities must resolve into a directed multigraph. OSM ways are inherently bidirectional unless tagged with oneway=yes, oneway=-1, or turn restriction relations. During materialization, we decompose ways into atomic edges, applying directional flags, speed limits, and surface friction coefficients. Downstream Attribute Synchronization Techniques ensure that temporal updates (e.g., road closures, seasonal weight limits) propagate without full graph rebuilds.

Edge weights are normalized to traversal time rather than raw distance. The routing cost function incorporates base speed, surface degradation, turn penalties, and traffic signal delays. When integrating location-based services, POI Enrichment Workflows attach delivery zones, charging stations, and curb access metadata to adjacent graph nodes without bloating the core routing topology.

Production Hardening & Transaction Boundaries

High-throughput ingestion requires explicit backpressure management and idempotent retry logic. Connection acquisition timeouts, circuit breakers, and exponential backoff prevent driver exhaustion during database maintenance windows. For high-throughput environments, Async Batch Processing for Graphs outlines semaphore-based concurrency control and chunked commit strategies.

Transaction boundaries must align with spatial partition limits. Committing across partition boundaries introduces distributed transaction overhead and increases rollback probability. We enforce atomic partition commits, logging each batch to an immutable audit ledger. During regional outages or schema migrations, Emergency Bypass & Audit Trails provide deterministic fallback paths and replayable ingestion checkpoints.

Scaling Trade-offs

  • Memory vs. Throughput: Streaming PBF blocks reduces RAM footprint but increases CPU overhead for repeated tag parsing. Pre-filtering tags at the parser level yields optimal balance.
  • Snapping Tolerance vs. Accuracy: Aggressive snapping (>1.0m) collapses distinct parallel roads; conservative snapping (<0.2m) leaves micro-gaps that break A* and contraction hierarchy routing.
  • Batch Size vs. Transaction Log: Larger batches reduce round-trip latency but risk write-ahead log saturation. The 50k/200k node/edge cap aligns with default Neo4j transaction log rotation thresholds.

Spatial graph construction from OSM is not a one-time ETL task; it is a continuous topology reconciliation process. By enforcing deterministic snapping, bounded concurrency, and routing-ready edge weighting, engineering teams can maintain sub-millisecond query latency even as metropolitan networks scale into millions of nodes and edges.