Reducing RAM Usage During Dense Matching

You launch dense matching on a multi-hundred-image block, the run climbs for twenty minutes, and then the process vanishes — dmesg shows a line like Out of memory: Killed process 18442 (DensifyPointCloud), the depth-map directory holds a few orphaned .dmap files, and the final .ply is either missing or truncated. This is the single most common failure mode when surveying technicians and Python GIS developers scale Multi-View Stereo (MVS) past a test flight. Dense matching is the most memory-intensive phase of UAV photogrammetry, and resident set size grows non-linearly with sensor resolution, forward overlap, and thread count, so a configuration that ran fine on 80 images detonates on 800. This page shows how to hold peak RAM under a hard ceiling by partitioning the block into tiles, deriving a safe worker count from available memory, and forcing process recycling between tiles — the memory-control layer beneath the broader parallel processing strategies for alignment.

Why dense matching exhausts RAM

Sparse alignment stores one small descriptor set and one camera pose per image; dense matching abandons that compact representation and computes a per-pixel depth estimate for every frame. Patch-based optimisation and plane-sweep stereo both materialise large intermediate buffers — a depth map, a normal map, and a confidence map at (or near) full image resolution — for each view in the current working set, and they hold several neighbouring views in memory simultaneously to cross-check depth hypotheses.

Three properties of drone data make this explosive rather than merely large:

Resolution scales the buffer quadratically. A 20 MP frame needs roughly four times the depth-buffer memory of a 5 MP frame. Float32 depth + normal + confidence maps for a single 5472 × 3648 frame already approach 300 MB before any neighbour views are loaded.
High forward overlap inflates the working set. At 80% overlap every surface point is seen by many frames, so the densifier keeps more simultaneous views resident to fuse them. Overlap is what makes the reconstruction accurate — it is also what makes it memory-hungry. Setting overlap correctly upstream, via the optimal flight overlap calculation, is the first lever, but it cannot be lowered enough to fix dense matching without harming the model.
Concurrency multiplies everything. Peak RAM is approximately the per-view working set times the number of concurrent workers. Spawn eight workers against an 8 GB-per-tile job and you have implicitly requested 64 GB.

A single unpartitioned tile across 4K-plus imagery at 80% overlap can exceed 32 GB of resident memory before depth fusion even begins. The fix is therefore not “use less memory per pixel” — the algorithm’s appetite is fixed — but bound the work in space and bound the concurrency, so the product of working-set size and worker count never crosses physical RAM. This is the same out-of-core discipline applied downstream in memory management for large point clouds, pushed one stage earlier into densification.

The maximum safe worker count follows directly from available memory and the measured per-worker buffer size:

\text{max\_workers} = \left\lfloor \frac{\text{RAM}_{\text{avail}} \times 0.75}{\text{buffer}_{\text{worker}}} \right\rfloor

The 0.75 reserves 25% headroom for OS paging, the page cache that the densifier’s own file I/O fills, and Python’s garbage collector — without it the box wedges into swap at exactly the moment fusion peaks.

Minimal memory-bounded dispatcher

The routine below is the focused fix: it measures real available RAM, derives a worker count from the formula above, and dispatches tiles through a multiprocessing.Pool with maxtasksperchild=1 so every tile runs in a fresh process that is torn down — and its memory fully reclaimed — before the next begins. Replace mvs_engine and its flags with your actual binary (colmap, OpenMVS/DensifyPointCloud, or the ODM --pc-quality workflow); flag names vary by tool, so consult your engine’s CLI reference for the exact equivalents.

import os
import logging
import subprocess
from pathlib import Path
from multiprocessing import Pool

import psutil  # >= 5.9

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")


def safe_worker_count(buffer_gb: float = 4.0) -> int:
    """Workers that fit available RAM with 25% headroom, capped at core count."""
    avail_gb = psutil.virtual_memory().available / (1024 ** 3)
    by_memory = int((avail_gb * 0.75) / buffer_gb)        # the formula above
    return max(1, min(by_memory, os.cpu_count() or 4))    # never exceed cores


def run_tile(args: tuple[int, int, int, Path]) -> bool:
    """Densify one spatial tile in its own process with a hard memory cap."""
    x, y, size, out_dir = args
    name = f"tile_{x}_{y}"
    cmd = [
        "mvs_engine", "--input", "aligned_block",
        "--tile-size", f"{size}x{size}", "--tile-offset", f"{x},{y}",
        "--overlap", "10%", "--pc-quality", "medium",
        "--max-memory-mb", "4096",            # engine-side ceiling, mirrors buffer_gb
        "--output", str(out_dir / name),
    ]
    try:
        subprocess.run(cmd, check=True, capture_output=True, text=True)
        logging.info("completed %s", name)
        return True
    except subprocess.CalledProcessError as exc:
        logging.error("failed %s: %s", name, exc.stderr.strip()[:200])
        return False


def densify(block_w: int, block_h: int, tile: int = 2000, overlap: float = 0.10):
    out_dir = Path("dense_chunks"); out_dir.mkdir(parents=True, exist_ok=True)
    step = int(tile * (1 - overlap))                       # buffered tile stride
    tiles = [(x, y, tile, out_dir)
             for x in range(0, block_w, step)
             for y in range(0, block_h, step)]
    workers = safe_worker_count(buffer_gb=4.0)
    logging.info("dispatching %d tiles across %d workers", len(tiles), workers)
    with Pool(processes=workers, maxtasksperchild=1) as pool:  # 1 = recycle each tile
        results = pool.map(run_tile, tiles)
    logging.info("done: %d/%d tiles succeeded", sum(results), len(tiles))

Three lines carry the whole result. safe_worker_count ties concurrency to current free memory rather than a hard-coded number, so the same script survives a 16 GB laptop and a 128 GB workstation. The step = int(tile * (1 - overlap)) stride leaves a 10% buffer between adjacent tiles, which is what lets depth maps stitch without a visible seam at tile boundaries. And maxtasksperchild=1 is the part most pipelines omit: C-extension densifiers allocate through their own allocator, so Python’s reference counting never returns that memory until the process exits — recycling after every tile guarantees a clean slate and prevents the slow creep that kills long batch runs.

A 2000 × 2000 pixel tile with 10% overlap is the workable default. Pair it with a tiered quality pass — run medium (or low) first to validate geometry and locate low-texture zones such as water or uniform asphalt, then re-run only the validated regions of interest at high. This avoids paying the high-quality memory and time cost on the entire block and typically cuts total processing time by 40–60%. The tiering is the densification counterpart of the spatial split described in the dataset-splitting routine.

Edge-case matrix

The dispatcher must degrade predictably, not silently, when inputs are hostile. Each row is a condition the routine above is built to survive.

Input variant	Symptom if unhandled	Expected handling
`available RAM < buffer_gb`	`by_memory` rounds to 0, `Pool(processes=0)` raises	`max(1, ...)` floors workers at 1; tile still runs, just serially
Free RAM > total cores × buffer	over-subscription, context-switch thrash	`min(by_memory, cpu_count())` caps workers at physical cores
Working dir on SATA SSD, not NVMe	workers starve, buffers queue, RAM inflates	require ≥ 2000 MB/s scratch; depth/normal maps are written at high frequency
C-extension leak across tiles	RSS creeps until OOM mid-batch	`maxtasksperchild=1` forces full teardown after each tile
One tile fails (low texture / no parallax)	`pool.map` aborts the whole block	`run_tile` catches `CalledProcessError`, returns `False`, batch continues
Fusion peak exceeds RAM despite caps	hard OOM kill, no recovery	16 GB swap as a slow safety net; keep `vm.swappiness` at 10–30

Verifying the ceiling holds

Capping workers is worthless if the per-tile working set was estimated wrong. Sample peak RSS while a single tile runs and assert it stayed inside the budget before trusting the configuration on a full block.

import time
import psutil
import subprocess


def assert_tile_within_budget(cmd: list[str], budget_gb: float = 4.0,
                              poll_s: float = 0.5) -> None:
    """Run one tile, sample peak RSS, and fail if it breaches the budget."""
    proc = subprocess.Popen(cmd)
    ps = psutil.Process(proc.pid)
    peak = 0
    while proc.poll() is None:
        try:                                   # sum children: the engine forks
            rss = ps.memory_info().rss + sum(
                c.memory_info().rss for c in ps.children(recursive=True))
            peak = max(peak, rss)
        except psutil.NoSuchProcess:
            break
        time.sleep(poll_s)
    peak_gb = peak / (1024 ** 3)
    assert proc.returncode == 0, f"tile exited {proc.returncode} (likely OOM)"
    assert peak_gb <= budget_gb * 1.1, (        # 10% sampling tolerance
        f"peak {peak_gb:.1f} GB breached {budget_gb} GB budget — lower buffer_gb")
    print(f"OK: peak {peak_gb:.1f} GB within {budget_gb} GB budget")

Feed buffer_gb in safe_worker_count the value you actually measured here, not a guess. If a single tile already peaks above the budget, the tile itself is too large — shrink tile to 1500 or 1000 before adding concurrency, because no worker-count maths can rescue a tile that does not fit on its own.

When to escalate

This per-host tiling fix is sufficient for most corridor and regional surveys, but stop and move up to the parallel processing strategies for alignment when:

A single 1000 × 1000 tile still OOMs at low quality. The working set for one view exceeds your machine’s RAM regardless of concurrency; you need cross-host distribution or a GPU densifier with managed VRAM spill, not finer tiling.
Tiling produces seam artefacts or depth discontinuities at block boundaries that the 10% overlap does not absorb — the parent page covers block partitioning and tie-point redundancy that preserve geometric continuity across tiles.
The bottleneck has moved from RAM to wall-clock even though memory now stays bounded. Scheduling, per-stage worker pools, and disk-throughput tuning belong to the parent workflow rather than this single-stage memory cap.

← Parallel Processing Strategies for Alignment

Reducing RAM Usage During Dense Matching

# Why dense matching exhausts RAM

# Minimal memory-bounded dispatcher

# Edge-case matrix

# Verifying the ceiling holds

# When to escalate

# Related

Why dense matching exhausts RAM

Minimal memory-bounded dispatcher

Edge-case matrix

Verifying the ceiling holds

When to escalate

Related