Write Path Pipeline: Chunking During FITS to Zarr Conversion¶
When converting OVRO-LWA FITS files to Zarr format, chunking decisions are made at write time and persist for the lifetime of the dataset. This guide explains how the ingest pipeline handles chunking, where to configure it, and how to verify the results.
The Ingest Pipeline Flow¶
The FITS to Zarr conversion follows a multi-stage pipeline that processes observations grouped by time and frequency:
flowchart LR
A[FITS Files] --> B[fix_fits_headers]
B --> C[_load_for_combine]
C --> D[_combine_time_step]
D --> E[_write_or_append_zarr]
E --> F[Zarr Store]
style A fill:#e1f5ff
style F fill:#d4edda
style C fill:#fff4e1
Pipeline stages:
-
fix_fits_headers(): Normalizes FITS headers to ensure xradio compatibility by materializing BSCALE/BZERO scaling and adding required keywords.
-
_load_for_combine() (src/ovro_lwa_portal/fits_to_zarr_xradio.py:241): Reads individual FITS files, computes RA/Dec coordinates from the WCS, and applies spatial chunking via
xds.chunk({"l": chunk_lm, "m": chunk_lm})(line 316). -
_combine_time_step() (line 356): Merges multiple frequency subbands for a single time step using
xr.combine_by_coords(), clearing encoding metadata to prevent conflicts (line 416). -
_write_or_append_zarr() (line 445): Writes the combined dataset to disk using xradio's
write_image()function. For appends, it reads the existing store lazily, concatenates along the time dimension, and writes atomically to a temporary path before swapping into place. -
Zarr Store: The final Zarr directory contains chunked arrays stored as individual compressed blobs, with metadata in
.zarrayand.zattrsfiles.
Where chunking happens: The chunking decision is made in
_load_for_combine() at line 315–316, where the chunk_lm parameter controls the
spatial tile size. Time, frequency, and polarization dimensions are not
explicitly chunked at this stage and inherit defaults from xradio's
write_image() behavior.
The chunk_lm Parameter¶
The chunk_lm parameter controls the chunk size for the l and m spatial
dimensions, defining the size of spatial tiles that become individual chunks on
disk.
Default value: 1024 (creates 1024×1024 spatial tiles)
Setting via CLI:
# Use default chunking (1024×1024)
ovro-ingest convert /path/to/fits /path/to/output
# Custom chunk size (2048×2048 tiles)
ovro-ingest convert /path/to/fits /path/to/output --chunk-lm 2048
# Disable spatial chunking (single chunk per spatial frame)
ovro-ingest convert /path/to/fits /path/to/output --chunk-lm 0
Setting via Python API:
from pathlib import Path
from ovro_lwa_portal.ingest import FITSToZarrConverter
from ovro_lwa_portal.ingest.core import ConversionConfig
# Default chunking
config = ConversionConfig(
input_dir=Path("/path/to/fits"),
output_dir=Path("/path/to/output"),
zarr_name="observations.zarr",
)
# Custom chunk size
config = ConversionConfig(
input_dir=Path("/path/to/fits"),
output_dir=Path("/path/to/output"),
zarr_name="observations.zarr",
chunk_lm=2048, # 2048×2048 spatial tiles
)
# Disable spatial chunking
config = ConversionConfig(
input_dir=Path("/path/to/fits"),
output_dir=Path("/path/to/output"),
zarr_name="observations.zarr",
chunk_lm=0, # No spatial chunking
)
converter = FITSToZarrConverter(config)
result = converter.convert()
What chunk_lm=0 does: Setting chunk_lm=0 disables spatial chunking
entirely. The conditional check at
src/ovro_lwa_portal/fits_to_zarr_xradio.py:315 skips the xds.chunk() call,
leaving spatial dimensions as single monolithic chunks. This configuration is
not recommended for cloud storage access, as it forces reading the entire
4096×4096 spatial grid (64 MB uncompressed) even when analyzing a small region.
Code reference: The chunk_lm parameter is applied in _load_for_combine():
# src/ovro_lwa_portal/fits_to_zarr_xradio.py:315-316
if chunk_lm and {"l", "m"} <= set(xds.dims):
xds = xds.chunk({"l": chunk_lm, "m": chunk_lm})
This check ensures chunking only applies when both l and m dimensions exist
in the dataset and chunk_lm is non-zero.
What Happens to Other Dimensions¶
The OVRO-LWA data model has five dimensions: time, frequency, polarization, l, and m. The ingest pipeline only explicitly chunks the spatial dimensions (l and m). Time, frequency, and polarization dimensions follow xradio's default chunking behavior.
xradio defaults (as of xradio 0.0.37):
- time: Typically chunked with size 1 (one chunk per time step)
- frequency: Typically chunked with size 1 (one chunk per frequency channel)
- polarization: Typically chunked with size 1 (one chunk per polarization)
These defaults align with the OVRO-LWA workflow, which processes observations as time-frequency grids. Each (time, frequency, polarization) combination gets its own set of spatial chunks on disk.
Implication for chunk layout: With the default chunk_lm=1024 on a
4096×4096 spatial grid, each time-frequency-polarization point generates 16
spatial chunks (4×4 tiles). For a typical OVRO-LWA dataset with 10 time steps,
48 frequency channels, and 1 polarization (Stokes I only), the total chunk count
is:
This granular chunking enables efficient parallel access for both time-series extraction (reading a fixed spatial position across time and frequency) and map generation (reading all spatial chunks for a single time-frequency point).
Encoding metadata clearing: The pipeline clears encoding metadata at
multiple points (lines 313, 416, 472) to prevent xarray from inheriting
inappropriate chunk shapes or compression settings from intermediate datasets.
The pattern xds[v].encoding = {} ensures that xradio's write_image()
function applies its own defaults rather than propagating stale metadata.
Compression at Write Time¶
The ingest pipeline does not currently configure explicit compression
settings. Compression is handled implicitly by xradio's write_image()
function, which uses Zarr's default compressor.
Current state: The encoding dictionary is explicitly cleared before writing:
This pattern appears at three locations (lines 313, 416, 472) to ensure clean encoding metadata throughout the pipeline. Without explicit compression configuration, Zarr applies its built-in Blosc compressor with default settings.
What xradio's write_image() does: Under the hood, write_image() delegates
to xarray's .to_zarr() method, which creates Zarr arrays with default
compression. The actual compression codec and settings applied by default are
not explicitly documented and may vary. To verify the compression configuration
for your specific dataset, inspect the .zarray metadata as described in the
"Inspecting Chunk Metadata" section below.
How to configure explicit compression: To control compression settings,
modify the encoding dictionary before the write_image() call. Insert this
pattern in _load_for_combine() or _write_or_append_zarr():
from numcodecs import Blosc
# Example: Configure zstd compression (level 3) with no shuffle
for v in xds.data_vars:
xds[v].encoding = {
"compressor": Blosc(cname="zstd", clevel=3, shuffle=0),
}
# Alternative: Use lz4 for faster compression/decompression
for v in xds.data_vars:
xds[v].encoding = {
"compressor": Blosc(cname="lz4", clevel=5, shuffle=2),
}
Place this configuration after the encoding clearing step but before the
write_image() call. For production deployments, experiment with different
compressors (zstd, lz4, blosclz) and levels to balance compression ratio against
read/write performance.
Append Mode and Chunk Consistency¶
The _write_or_append_zarr() function supports incremental updates by
concatenating new time steps to an existing Zarr store. This append behavior
enables processing observations in batches without rebuilding the entire
dataset.
How append works: When first_write=False, the function
(src/ovro_lwa_portal/fits_to_zarr_xradio.py:445):
- Opens the existing Zarr store lazily with
xr.open_zarr()(line 469) - Clears encoding metadata to prevent conflicts (lines 470–472)
- Concatenates the existing and new datasets along the time dimension (line 481)
- Writes the combined dataset to a temporary path (line 489)
- Atomically swaps the temporary store into place (lines 492–494)
This read-concat-rewrite pattern ensures that the source store remains readable during the append operation, avoiding data loss if the write fails mid-process.
CRITICAL: Chunk shape consistency
Zarr requires all chunks along a dimension to have the same shape. If you append
data with a different chunk_lm value than the original store, the Zarr
metadata will become inconsistent, silently corrupting the chunk layout.
!!! warning "Chunk Size Consistency Required" Always use the same chunk_lm value when appending to an existing store. Changing chunk_lm between appends will create a malformed Zarr array that may fail to read or return incorrect data.
Example of incorrect usage:
# First conversion with chunk_lm=1024
config1 = ConversionConfig(
input_dir=Path("/path/to/batch1"),
output_dir=Path("/path/to/output"),
zarr_name="observations.zarr",
chunk_lm=1024, # 1024×1024 tiles
)
converter1 = FITSToZarrConverter(config1)
converter1.convert()
# WRONG: Appending with different chunk_lm
config2 = ConversionConfig(
input_dir=Path("/path/to/batch2"),
output_dir=Path("/path/to/output"),
zarr_name="observations.zarr",
chunk_lm=2048, # ❌ Different chunk size - will corrupt store!
rebuild=False,
)
converter2 = FITSToZarrConverter(config2)
converter2.convert() # DANGER: Store now has inconsistent chunks
Correct usage:
# First conversion with chunk_lm=1024
config1 = ConversionConfig(
input_dir=Path("/path/to/batch1"),
output_dir=Path("/path/to/output"),
zarr_name="observations.zarr",
chunk_lm=1024,
)
converter1 = FITSToZarrConverter(config1)
converter1.convert()
# Correct: Append with same chunk_lm
config2 = ConversionConfig(
input_dir=Path("/path/to/batch2"),
output_dir=Path("/path/to/output"),
zarr_name="observations.zarr",
chunk_lm=1024, # ✓ Same chunk size as original
rebuild=False,
)
converter2 = FITSToZarrConverter(config2)
converter2.convert() # Safe: Chunk shapes remain consistent
If you need to change chunk_lm, use rebuild=True to overwrite the store
completely rather than appending.
Inspecting Chunk Metadata¶
After conversion, verify the chunk configuration by inspecting Zarr metadata
files. Each Zarr array has a .zarray JSON file containing chunk shape, dtype,
and compressor settings.
Reading .zarray directly:
import json
from pathlib import Path
# Path to the Zarr store
store_path = Path("observations.zarr")
# Read metadata for the SKY data variable
meta = json.loads((store_path / "SKY" / ".zarray").read_text())
print(f"Chunk shape: {meta['chunks']}")
print(f"Compressor: {meta['compressor']}")
print(f"Dtype: {meta['dtype']}")
print(f"Array shape: {meta['shape']}")
Example output:
Chunk shape: [1, 1, 1, 1024, 1024]
Compressor: {'id': 'blosc', 'cname': 'lz4', 'clevel': 5, 'shuffle': 1}
Dtype: <f4
Array shape: [10, 48, 2, 4096, 4096]
This output confirms that:
- Spatial dimensions (l, m) are chunked as 1024×1024 tiles
- Time, frequency, and polarization have chunk size 1
- Data type is float32 (
<f4) - Blosc compressor with lz4 codec is active
- Array has 10 time steps, 48 frequency channels, and 2 polarizations (typical OVRO-LWA observations use 1 polarization)
Programmatic inspection with zarr:
import zarr
# Open the Zarr store
store = zarr.open("observations.zarr", mode="r")
# Access the SKY array
sky = store["SKY"]
print(f"Chunk shape: {sky.chunks}")
print(f"Compressor: {sky.compressor}")
print(f"Dtype: {sky.dtype}")
print(f"Shape: {sky.shape}")
print(f"Number of chunks: {sky.nchunks}")
Calculating compressed chunk sizes:
To measure actual storage footprint, inspect the chunk files on disk:
from pathlib import Path
# Path to a specific chunk file (e.g., time=0, freq=0, pol=0, l-tile=0, m-tile=0)
chunk_file = Path("observations.zarr/SKY/0.0.0.0.0")
if chunk_file.exists():
compressed_size = chunk_file.stat().st_size
print(f"Compressed chunk size: {compressed_size / 1024:.1f} KB")
else:
print("Chunk file not found")
For cloud storage (S3, OSN), use the storage provider's CLI to inspect object sizes:
# AWS S3
aws s3 ls s3://bucket/observations.zarr/SKY/ --recursive --human-readable | head -20
# OSN (via s3cmd)
s3cmd ls s3://bucket/observations.zarr/SKY/ --recursive | head -20
Verifying chunk alignment: To ensure chunk shapes match your expectations,
compare the .zarray metadata across different data variables (SKY, BEAM) and
verify consistency:
import json
from pathlib import Path
store_path = Path("observations.zarr")
for var_name in ["SKY", "BEAM"]:
var_path = store_path / var_name / ".zarray"
if var_path.exists():
meta = json.loads(var_path.read_text())
print(f"{var_name}: chunks = {meta['chunks']}")
All data variables should report identical chunk shapes if they share the same dimensions. Mismatched chunks indicate a configuration error during conversion.
See Also¶
- Chunking Fundamentals - Conceptual background on Zarr chunks and the 10-100 MB sweet spot
- FITS to Zarr Conversion - User guide for the conversion CLI and Python API
- FITS to Zarr API Reference - Low-level API documentation