Why did splitting my dataset by filename make reconstruction worse?

File order does not match ground position on a serpentine flight, so index-based chunks cut through dense tie-point clusters and leave each piece with a rank-deficient match matrix. Split by spatial position with buffered overlap instead, so every seam retains cross-tie observations.

How much overlap should buffered tiles share?

Keep the buffer at 20-25 percent of tile size. Below about 12 percent the seam carries too few cross-ties and bundle adjustment fails to converge; well above 30 percent you waste memory re-solving the same frames in multiple tiles.

What size should each tile be?

Size the tile so its frames fit the solver's RAM budget, roughly 80 to 120 MB of descriptors per high-resolution RGB image. An 800 m tile of about 400 frames stays under 16 GB; if a tile still OOM-kills, lower tile_m and re-run.

My southern-hemisphere survey came out mirrored. Why?

A northern UTM EPSG was used, flipping the northing axis. Select an EPSG in the 327xx band for southern latitudes so projected Y stays positive and the tile layout matches the ground.

Python Script to Split Large Datasets for Processing

You queued a 15,000-image survey for bundle adjustment and the solver either died with MemoryError during Jacobian assembly or churned for hours before reporting did not converge. The instinctive fix — split the folder into smaller batches by filename or index — made it worse: now the reconstruction has seams, tiles drift apart by metres at their shared edges, and the merged sparse cloud is unusable. This page solves that exact failure: how to cut an oversized image block into solver-sized pieces in Python without destroying the spatial adjacency that structure-from-motion depends on.

Why naive splitting breaks reconstruction

Bundle adjustment is a global least-squares solve over a graph of tie points. Two images contribute a usable constraint only when they observe the same ground features, which on a normal mission means they were captured physically close together with high forward and side overlap. The convergence of the solver therefore depends on the topology of that graph, not on the order images sit in a directory.

Sequential chunking — splitting by file index, capture time, or alphabetical sort — severs that topology. A serpentine flight pattern places DJI_0500.JPG and DJI_0501.JPG next to each other in the list but on opposite ends of adjacent flight lines on the ground; conversely, two frames that overlap heavily can be hundreds of indices apart. Cut the list into blocks and you slice straight through dense tie-point clusters, leaving each piece with a sparse, rank-deficient match matrix. Relative orientation then fails or, worse, succeeds on a degenerate subset and produces a self-consistent but geographically wrong block — the same axis-order-style drift the parent guide warns about, except caused by topology loss rather than a CRS bug.

The correct unit of partition is space, not sequence. If you tile the survey by ground position and let neighbouring tiles share a buffer zone of images, every internal seam still has hundreds of cross-tie observations spanning it, so each tile orients independently and the tiles re-register cleanly in a final global pass. That buffer is the whole game: too thin and the seams gap; too thick and you waste memory re-solving the same frames. This builds directly on the same overlap discipline used for calculating optimal flight overlap for Python processing — here you are preserving captured overlap rather than planning it.

Minimal reproducible solution

The partitioner below takes per-image GPS fixes (the output of your EXIF GPS validation pass), projects them into a metre-based CRS, and assigns each frame to every overlapping tile. Stride is deliberately smaller than tile size, so adjacent tiles share a buffer. It is intentionally compact — clarity over completeness — and emits a chunk manifest you can feed to the solver one tile at a time.

import json
import math
from collections import defaultdict
from pathlib import Path

from pyproj import Transformer  # reproject to metres — never tile in raw degrees


def load_fixes(manifest: Path) -> dict[str, tuple[float, float]]:
    # Reuse your EXIF GPS validation output: {"DJI_0001.JPG": [lat, lon], ...}
    return json.loads(manifest.read_text())


def covering_indices(coord: float, origin: float, stride: float, tile_m: float) -> range:
    # Every tile whose [t, t + tile_m] span contains coord — normally 1-2 tiles.
    lo = math.ceil((coord - origin - tile_m) / stride)
    hi = math.floor((coord - origin) / stride)
    return range(lo, hi + 1)


def split_with_overlap(fixes, tile_m=800.0, overlap=0.20, min_imgs=50):
    if not 0.10 <= overlap < 0.50:
        raise ValueError("overlap must be in [0.10, 0.50) to buffer seams without bloat")
    lats = [lat for lat, _ in fixes.values()]
    lons = [lon for _, lon in fixes.values()]

    # 1. Pick the UTM zone from the survey centroid and project every fix to metres.
    zone = int((sum(lons) / len(lons) + 180) // 6) + 1
    epsg = (32600 if sum(lats) / len(lats) >= 0 else 32700) + zone  # N: 326xx, S: 327xx
    to_utm = Transformer.from_crs(4326, epsg, always_xy=True)
    pts = {name: to_utm.transform(lon, lat) for name, (lat, lon) in fixes.items()}

    # 2. Stride < tile size means neighbouring tiles share a buffer of cross-ties.
    stride = tile_m * (1.0 - overlap)
    ox = min(x for x, _ in pts.values())
    oy = min(y for _, y in pts.values())

    chunks: dict[str, list[str]] = defaultdict(list)
    for name, (x, y) in pts.items():
        # A frame deliberately lands in every tile that overlaps its position.
        for c in covering_indices(x, ox, stride, tile_m):
            for r in covering_indices(y, oy, stride, tile_m):
                chunks[f"tile_{r:03d}_{c:03d}"].append(name)

    # 3. Drop tiles too sparse to orient; those frames survive in an overlapping neighbour.
    return {tid: imgs for tid, imgs in chunks.items() if len(imgs) >= min_imgs}


if __name__ == "__main__":
    tiles = split_with_overlap(load_fixes(Path("validated_fixes.json")))
    Path("chunks.json").write_text(json.dumps(tiles, indent=2))
    print(f"{len(tiles)} tiles; sizes: {sorted(len(v) for v in tiles.values())}")

The three parameters that matter are tile_m, overlap, and min_imgs. Size tile_m so a single tile fits the solver’s RAM budget — roughly 80–120 MB of descriptors per high-resolution RGB frame, so an 800 m tile carrying ~400 images sits comfortably under 16 GB. Keep overlap at 0.20–0.25: below ~0.12 the buffer carries too few cross-ties for the seam to close, which is the same convergence-starvation threshold that produces the parent guide’s did not converge plateau. min_imgs of 50 reflects the floor below which a tile’s relative-orientation matrix goes rank-deficient.

Edge-case matrix

Real survey folders are messier than the happy path. The table below lists the input variants that most often corrupt a partition and how the routine above handles each.

Input variant	Symptom if unhandled	Handling here
Frame with null / missing GPS	`KeyError` or silent `(0, 0)` placement dragging a tile to null island	Excluded upstream by the EXIF GPS validation pass — only validated fixes enter `load_fixes`
Tight isolated cluster below `min_imgs`	Orphan tile with a singular match matrix; relative orientation fails	Tile dropped at step 3; its frames persist in the overlapping neighbour tile
Linear corridor (powerline / road) survey	Long thin extent produces many near-empty tiles	Buffered stride keeps consecutive frames co-tiled; sparse end tiles fall below `min_imgs` and merge into neighbours
Survey straddling a UTM zone boundary	Metre coordinates shear at the seam, splitting one tile in two	Single centroid-derived zone keeps the whole block in one consistent metric frame
Southern-hemisphere mission	Wrong EPSG (northern zone) flips Y, mirroring the layout	Hemisphere test selects `327xx` so northings stay positive
One oversized tile exceeding RAM	`MemoryError` during descriptor extraction	Lower `tile_m`; the same fixes re-tile finer with no other change

Verify the split before you queue it

Never hand a manifest to the solver without asserting two invariants: no frame was silently lost, and neighbouring tiles genuinely share ties. The check below fails loudly if either breaks.

def verify_partition(fixes, tiles):
    assigned = {name for imgs in tiles.values() for name in imgs}
    dropped = set(fixes) - assigned
    # Every frame must survive in at least one retained tile.
    assert not dropped, f"{len(dropped)} frames orphaned, e.g. {sorted(dropped)[:5]}"

    # Overlap is real only if some tile pairs actually share frames.
    members = [set(v) for v in tiles.values()]
    shared = sum(bool(a & b) for i, a in enumerate(members) for b in members[i + 1:])
    print(f"{len(assigned)}/{len(fixes)} frames retained across {len(tiles)} tiles; "
          f"{shared} tile pairs share cross-ties")
    assert shared > 0, "No tiles overlap — bundle adjustment will gap at every seam"

A healthy result retains close to 100% of validated frames and reports a non-zero count of overlapping tile pairs. If dropped is large, your tile_m is too small for the flight density; if shared is zero, overlap collapsed to zero and you have rebuilt the naive split you were trying to avoid.

When to escalate

This spatial partitioner is the right tool when the block is simply too large for one solve. Move past it and back to the parent workflow when:

The merged tiles still drift after a global pass. That is no longer a splitting problem but a residual-distribution one — return to optimizing bundle adjustment with Python and rebalance against ground control before re-exporting.
Individual tiles fit RAM but each tile’s dense stage still OOM-kills. Tiling the alignment graph does not bound the densification stage; pair this with reducing RAM usage during dense matching.
Tiles share frames but not a coordinate frame. If your fixes were never normalised to one CRS, fix that first with managing coordinate reference systems in GDAL; no buffer can rescue tiles solved in different projections.

← Optimizing Bundle Adjustment with Python

Python Script to Split Large Datasets for Processing

# Why naive splitting breaks reconstruction

# Minimal reproducible solution

# Edge-case matrix

# Verify the split before you queue it

# When to escalate

# Related

Why naive splitting breaks reconstruction

Minimal reproducible solution

Edge-case matrix

Verify the split before you queue it

When to escalate

Related