Cloud Storage Configuration¶
Cloud object stores like AWS S3, Google Cloud Storage, and the Open Storage Network (OSN) provide scalable storage for OVRO-LWA datasets, but accessing Zarr data from cloud storage requires understanding HTTP request patterns, metadata consolidation, caching strategies, and provider-specific configuration.
This guide explains how cloud Zarr access works internally, provides configuration examples for major cloud providers, and documents optimization strategies for reducing latency and costs.
How Cloud Zarr Access Works¶
The fsspec Abstraction Layer¶
OVRO-LWA's open_dataset() uses the
fsspec library as a filesystem
abstraction layer sitting between xarray and cloud storage:
User Code
│
├─ open_dataset("s3://bucket/data.zarr")
│
└─> fsspec.filesystem("s3", ...)
│
├─ s3fs (AWS S3)
├─ gcsfs (Google Cloud Storage)
└─ http (HTTPS URLs)
│
└─> Cloud Storage (S3, GCS, OSN)
When you call open_dataset() with a cloud URL, the function:
- Detects the protocol (
s3://,gs://,https://) - Constructs an fsspec filesystem with
storage_options(credentials, endpoint) - Creates a mapper object via
fs.get_mapper(path)that presents cloud objects as a dict-like interface - Passes the mapper to
xr.open_zarr()which reads metadata and constructs Dask task graphs - On
.compute(), Dask workers fetch chunks via individual HTTP GET requests
See src/ovro_lwa_portal/io.py:622-656 for the mapper construction logic.
One Chunk = One HTTP Request¶
In Zarr v2 (the version used by OVRO-LWA), each chunk is stored as a separate object in cloud storage. There are no range requests within chunks — fetching any part of a chunk requires downloading the entire chunk object.
This one-to-one mapping has critical implications:
- A dataset with 10,000 chunks requires 10,000 HTTP GET requests to read fully
- Each request incurs ~50-100ms latency regardless of chunk size
- Small chunks (< 1 MB) waste time on HTTP overhead rather than data transfer
- Large chunks (> 100 MB) waste bandwidth when accessing subsets
See Chunking Fundamentals for the 10-100 MB compressed chunk size sweet spot that balances these factors.
Metadata Requests: The N+1 Problem¶
Before reading any chunk data, Zarr must fetch metadata describing array shapes, chunk layouts, compression settings, and data types. Without optimization, this creates an "N+1 problem":
Without consolidated metadata:
1. GET .zgroup (root group metadata)
2. GET SKY/.zarray (SKY array metadata)
3. GET SKY/.zattrs (SKY attributes)
4. GET BEAM/.zarray (BEAM array metadata)
5. GET BEAM/.zattrs (BEAM attributes)
... one request per array + attributes
For a store with 10 arrays, this is 20+ HTTP requests before any actual data is read. With typical 50-100ms latency per request, metadata alone adds 1-2 seconds of overhead.
With consolidated metadata:
A single request fetches all metadata at once, reducing open time from seconds to milliseconds. Consolidated metadata is essential for production cloud deployments.
How xr.open_zarr() Constructs the Mapper¶
When you call open_dataset(), it internally calls xr.open_zarr() with a
configured mapper. The process:
# Simplified from io.py:622-656
import fsspec
# Create filesystem with credentials
fs = fsspec.filesystem("s3", key="...", secret="...",
client_kwargs={"endpoint_url": "https://..."})
# Create mapper: dict-like interface to cloud objects
path = "bucket/prefix/data.zarr"
store = fs.get_mapper(path)
# Pass to xarray
ds = xr.open_zarr(store, consolidated=True, chunks="auto")
The mapper presents cloud storage as a Python dictionary where keys are object
paths like SKY/0.0.0.0.1 and values are the compressed chunk bytes.
Consolidated Metadata¶
What It Is¶
Consolidated metadata is a single .zmetadata file containing all array
metadata, group metadata, and attributes for an entire Zarr store. Instead of
dozens of small metadata requests, a single GET retrieves everything.
The .zmetadata file is a JSON document with this structure:
{
"metadata": {
".zgroup": {...},
"SKY/.zarray": {...},
"SKY/.zattrs": {...},
"BEAM/.zarray": {...},
"BEAM/.zattrs": {...}
},
"zarr_consolidated_format": 1
}
Why It Matters¶
Without consolidated metadata, opening a Zarr store from cloud storage requires one HTTP request per array plus one per group. For a store with 10 arrays, this is 20+ requests totaling 1-2 seconds of latency before any data is read.
With consolidated metadata, opening the same store requires one request and completes in ~50-100ms.
Performance impact example:
- Store with 15 arrays, no consolidated metadata: ~1.5s to open (15 arrays × 2 files × 50ms)
- Same store with consolidated metadata: ~50ms to open (1 request)
For interactive workflows where datasets are opened repeatedly, this 30× speedup in open time significantly improves user experience.
Checking for Consolidated Metadata¶
Verify whether a store has consolidated metadata:
import zarr
# For cloud stores, provide storage_options
store = zarr.open(
"s3://bucket/data.zarr",
mode="r",
storage_options={"key": "...", "secret": "..."}
)
# Check for .zmetadata file
if ".zmetadata" in store.store:
print("Consolidated metadata present")
else:
print("No consolidated metadata — performance will be degraded")
For local stores, check the filesystem directly:
Creating Consolidated Metadata¶
If a store lacks consolidated metadata, create it with
zarr.consolidate_metadata():
import zarr
# For local stores
zarr.consolidate_metadata("/path/to/data.zarr")
# For cloud stores (requires write access)
store = zarr.open(
"s3://bucket/data.zarr",
mode="r+",
storage_options={"key": "...", "secret": "..."}
)
zarr.consolidate_metadata(store.store)
Write Access Required
Creating consolidated metadata requires write access to the store. For read-only stores or stores you don't control, contact the data provider to request consolidated metadata.
How xr.open_zarr(consolidated=True) Works¶
OVRO-LWA's open_dataset() passes consolidated=True to xr.open_zarr() by
default. This enables consolidated metadata loading with graceful fallback:
- xarray attempts to read
.zmetadata - If
.zmetadataexists, all metadata is loaded from this single file - If
.zmetadatais missing, xarray falls back to reading individual.zarrayand.zattrsfiles
The fallback behavior means consolidated=True never causes errors, but
performance degrades when .zmetadata is absent.
OVRO-LWA Ingest Pipeline Behavior¶
The OVRO-LWA FITS-to-Zarr conversion pipeline in
ovro_lwa_portal.ingest.fits_to_zarr_xradio writes Zarr stores via xradio's
write_image() function. As of the current implementation, consolidated
metadata is not explicitly created during ingest.
Verify Consolidated Metadata Before Production Deployment
Always verify that a Zarr store has consolidated metadata before deploying
it for production use or sharing with users. Open time can be 30× slower
without it. Create consolidated metadata manually using
zarr.consolidate_metadata() if needed.
Provider-Specific Configuration¶
AWS S3¶
AWS S3 is the most widely used cloud object store. Authentication supports environment variables, IAM roles, or explicit credentials.
Basic S3 access:
import ovro_lwa_portal
ds = ovro_lwa_portal.open_dataset(
"s3://bucket-name/observation.zarr",
storage_options={
"key": "<AWS_ACCESS_KEY_ID>",
"secret": "<AWS_SECRET_ACCESS_KEY>",
}
)
Authentication methods:
- Explicit credentials (shown above): Pass
keyandsecretinstorage_options - Environment variables: Set
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY, then omitstorage_options - IAM roles: When running on EC2/ECS/Lambda, omit
storage_optionsand credentials are automatically retrieved from the instance metadata service - Shared credentials file: Configure
~/.aws/credentialsand omitstorage_options
Specifying region:
S3 buckets exist in specific regions. Accessing from the same region
significantly reduces latency per request compared to cross-region access —
choose the region closest to where your compute runs. Specify region in
storage_options:
ds = ovro_lwa_portal.open_dataset(
"s3://bucket-name/observation.zarr",
storage_options={
"key": "...",
"secret": "...",
"client_kwargs": {"region_name": "us-west-2"},
}
)
If your bucket is in us-east-1 but you're accessing from us-west-2, expect
higher latency. Use aws s3api get-bucket-location --bucket bucket-name to
check bucket region.
Google Cloud Storage¶
Google Cloud Storage (GCS) uses different authentication but follows similar patterns.
Basic GCS access:
import ovro_lwa_portal
ds = ovro_lwa_portal.open_dataset(
"gs://bucket-name/observation.zarr",
storage_options={
"token": "/path/to/service-account-key.json"
}
)
Authentication methods:
- Service account key file (shown above): Download a JSON key from GCP
console and pass the path in
storage_options - Application default credentials: Set
GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.jsonenvironment variable and omitstorage_options - Compute Engine default service account: When running on GCP Compute
Engine, omit
storage_optionsto use the instance service account
Alternative URL format:
GCS supports both gs:// and gcs:// protocols:
# Both are equivalent
ds = ovro_lwa_portal.open_dataset("gs://bucket/data.zarr")
ds = ovro_lwa_portal.open_dataset("gcs://bucket/data.zarr")
Open Storage Network (OSN)¶
OSN provides S3-compatible object storage for academic research. OSN endpoints differ from AWS S3 in endpoint URLs and latency characteristics.
Basic OSN access:
import ovro_lwa_portal
ds = ovro_lwa_portal.open_dataset(
"s3://bucket-name/observation.zarr",
storage_options={
"key": "OSN_ACCESS_KEY",
"secret": "OSN_SECRET_KEY",
"client_kwargs": {
"endpoint_url": "https://caltech1.osn.mghpcc.org"
},
}
)
OSN endpoints:
Different OSN pods have different endpoint URLs. Common endpoints:
https://uma1.osn.mghpcc.org(UMass Amherst)https://caltech1.osn.mghpcc.org(Caltech)https://renc1.osn.mghpcc.org(RENCI)
Verify the correct endpoint for your bucket with the OSN allocation manager or bucket owner.
DOI Resolution to OSN:
OVRO-LWA datasets archived on OSN may be assigned DOIs. The resolve_source()
function handles DOI → HTTPS → S3 resolution automatically:
from ovro_lwa_portal import resolve_source
# Resolve DOI to see the full chain
info = resolve_source(
"doi:10.33569/9wsys-h7b71",
production=False,
storage_options={"key": "...", "secret": "..."}
)
print(info["resolved_url"]) # HTTPS OSN URL
print(info["s3_url"]) # S3 URL (if credentials provided)
print(info["endpoint"]) # OSN endpoint
When credentials are provided, open_dataset() automatically converts OSN HTTPS
URLs to S3 URLs for better performance (S3 protocol is faster than HTTPS for
multi-object access). See src/ovro_lwa_portal/io.py:168-213 for the conversion
logic in _convert_osn_https_to_s3().
OSN Latency Characteristics
OSN endpoints have different latency characteristics than AWS S3 proper due to different geographic routing and network paths. The actual difference is not quantified — benchmark your specific OSN endpoint using the techniques in Benchmarking Performance to establish baseline expectations for your infrastructure.
Caching Strategies¶
Caching reduces repeated HTTP requests by storing previously fetched chunks locally. Multiple caching strategies exist, each suited for different access patterns.
fsspec simplecache — Transparent Per-Chunk Caching¶
The simplest caching strategy: fsspec automatically caches chunks as they're fetched, storing them in a local directory.
How to configure:
import fsspec
import ovro_lwa_portal
# Create a caching filesystem
fs = fsspec.filesystem(
"simplecache",
target_protocol="s3",
target_options={
"key": "...",
"secret": "...",
"client_kwargs": {"endpoint_url": "https://..."},
},
cache_storage="/tmp/zarr_cache",
)
# Create mapper from the caching filesystem
mapper = fs.get_mapper("bucket-name/observation.zarr")
# Open with xarray
import xarray as xr
ds = xr.open_zarr(mapper, consolidated=True, chunks="auto")
When to use:
- Repeatedly accessing the same chunks within a single session
- Interactive analysis where you load, analyze, and reload the same data regions
- Memory constraints prevent loading full dataset with
chunks=None
Behavior:
- First access: Fetches chunk from cloud, stores in
cache_storage - Subsequent access: Reads chunk from local cache (no HTTP request)
- Cache persists until manually deleted or system reboot (if using
/tmp)
Limitations:
- Cache is not shared across processes
- No automatic eviction — cache grows unbounded
- Cache directory must have sufficient disk space
fsspec filecache — Persistent Local Cache¶
Similar to simplecache but with persistent storage that survives restarts and
smarter cache management.
How to configure:
import fsspec
fs = fsspec.filesystem(
"filecache",
target_protocol="s3",
target_options={"key": "...", "secret": "..."},
cache_storage="/home/user/.cache/ovro_lwa_zarr",
expiry_time=86400, # Cache expires after 24 hours (seconds)
)
mapper = fs.get_mapper("bucket-name/observation.zarr")
When to use:
- Repeated analyses on the same dataset over multiple days
- Workflows where the dataset doesn't change but you run different queries
- Shared cache directory across multiple scripts/notebooks
Behavior:
- Chunks cached to
cache_storagewith metadata (timestamp, size) - Cache survives system restarts
- Chunks older than
expiry_timeare automatically re-fetched on next access - Cache directory can be shared across processes (filesystem locking)
Configuration options:
expiry_time: Seconds before cached chunks are considered stale (default: 604800 = 7 days)cache_check: How often to check for cache validity (default: 60s)same_names: If True, cache files use original names (default: False uses hashed names)
Full Local Download¶
For the ultimate in read performance, download the entire Zarr store locally
using aws s3 sync, gsutil rsync, or rclone.
Using aws CLI:
# Sync entire store to local directory
aws s3 sync s3://bucket-name/observation.zarr /data/local/observation.zarr
# Configure endpoint for OSN
aws s3 sync s3://bucket-name/observation.zarr /data/local/observation.zarr \
--endpoint-url https://caltech1.osn.mghpcc.org
Using rclone (works for all cloud providers):
# Configure remote in ~/.config/rclone/rclone.conf first
rclone sync remote:bucket-name/observation.zarr /data/local/observation.zarr
When to use:
- Dataset fits on local storage (typically < 100 GB)
- Workflow requires reading most or all chunks
- Performance is critical (local NVMe SSD → 500+ MB/s vs cloud → 10-50 MB/s)
- Network is unreliable or slow
After downloading:
import ovro_lwa_portal
# Read from local copy — no cloud access
ds = ovro_lwa_portal.open_dataset("/data/local/observation.zarr")
Decision Guide¶
Choose a caching strategy based on your access pattern and storage constraints:
| Access Pattern | Recommended Strategy | Rationale |
|---|---|---|
| One-time exploratory analysis | No caching | Overhead not worth it for single access |
| Interactive session, same data | simplecache |
Fast repeated access within session |
| Daily workflow, same dataset | filecache |
Persistent cache survives restarts |
| Processing all chunks, local space | Full download | Maximum performance, no network variability |
| Large dataset, small subsets | simplecache |
Cache only the chunks you use |
| Shared analysis across team members | Full download | One download benefits multiple users |
| Dataset > local storage capacity | filecache |
Selective caching with automatic eviction |
| Unreliable network | Full download | Eliminate network dependency for critical work |
Concurrent Access Tuning¶
How Dask Parallelizes Chunk Fetches¶
When you call .compute() on a Dask array, Dask creates a task graph where each
task corresponds to fetching and processing one or more chunks. Dask workers
execute these tasks in parallel, fetching multiple chunks concurrently.
Parallelism example:
import ovro_lwa_portal
from dask.distributed import Client
# Start local cluster with 4 workers
client = Client(n_workers=4)
ds = ovro_lwa_portal.open_dataset(
"s3://bucket/data.zarr",
storage_options={"key": "...", "secret": "..."},
chunks={"time": 1, "frequency": 1, "l": 1024, "m": 1024}
)
# Compute triggers parallel chunk fetches across 4 workers
result = ds.SKY.isel(time=slice(0, 10)).mean(dim=["l", "m"]).compute()
Each of the 4 workers fetches chunks in parallel, issuing multiple HTTP requests simultaneously. This parallelism is critical for good cloud performance: a single worker fetching chunks serially would take ~50ms × N chunks, while 4 workers reduce this to ~50ms × (N / 4).
s3fs max_concurrency Parameter¶
The s3fs library (underlying fsspec for S3 access) limits the number of
concurrent requests per filesystem instance. The default is max_concurrency=1,
meaning requests are serialized even when Dask has multiple workers.
Increasing concurrency:
import ovro_lwa_portal
ds = ovro_lwa_portal.open_dataset(
"s3://bucket/data.zarr",
storage_options={
"key": "...",
"secret": "...",
"max_concurrency": 10,
}
)
Guidelines:
- For single-worker Dask:
max_concurrency=5-10improves throughput - For multi-worker Dask (4+ workers):
max_concurrency=1per worker (each worker gets its own filesystem instance) - Avoid
max_concurrency > 20without benchmarking — diminishing returns and potential rate limiting
fsspec Retry and Timeout Configuration¶
Cloud networks are unreliable. Configure retries and timeouts to handle transient failures gracefully.
Configuring retries:
import ovro_lwa_portal
ds = ovro_lwa_portal.open_dataset(
"s3://bucket/data.zarr",
storage_options={
"key": "...",
"secret": "...",
"config_kwargs": {
"retries": {"max_attempts": 5},
},
}
)
Configuring timeouts:
ds = ovro_lwa_portal.open_dataset(
"s3://bucket/data.zarr",
storage_options={
"key": "...",
"secret": "...",
"client_kwargs": {
"config": {
"connect_timeout": 10, # seconds
"read_timeout": 60, # seconds
}
},
}
)
When to adjust:
- Unreliable network: Increase
max_attemptsto 5-10 - Fast network, small chunks: Decrease
read_timeoutto 10-20s to fail fast - Slow network, large chunks: Increase
read_timeoutto 120-300s
Balancing Concurrency vs. Rate Limiting¶
Cloud providers throttle requests at high rates to prevent abuse and ensure fair resource allocation. Excessive concurrency triggers rate limiting errors.
Common rate limits:
- AWS S3: 5,500 GET requests per second per prefix
- Google Cloud Storage: 5,000 requests per second per bucket
- OSN: Provider-dependent, typically lower than commercial clouds
Symptoms of rate limiting:
- HTTP 503 "Slow Down" errors
- Exponential backoff warnings in logs
- Degraded performance at high worker counts
Avoid Excessive Concurrency Without Benchmarking
Setting max_concurrency too high or using too many Dask workers can
trigger rate limiting, degrading performance below single-worker baseline.
Always benchmark with
the techniques in Benchmarking Performance to
verify that increased concurrency improves throughput.
Mitigation strategies:
- Reduce Dask worker count (fewer concurrent requests)
- Implement request throttling:
dask.config.set({"distributed.worker.connections.incoming": 2}) - Use caching to reduce request volume (see Caching Strategies above)
- Consolidate small chunks into larger chunks to reduce total request count (see Chunking Fundamentals)
Cost Considerations¶
Cloud storage pricing includes three components: storage fees, request fees, and data transfer (egress) fees. Zarr access patterns directly affect request and egress costs.
GET Request Pricing¶
Cloud providers charge per request. For AWS S3 as of publication:
- GET request: $0.0004 per 1,000 requests
- PUT request: $0.005 per 1,000 requests (10× more expensive than GET)
Prices Change — Verify Current Pricing
The prices listed here are examples from AWS S3 standard tier as of early 2024. Verify current pricing at AWS S3 Pricing, GCS Pricing, or your provider's documentation.
Example cost calculation:
- Dataset with 10,000 chunks
- Full dataset scan: 10,000 GET requests
- Cost: 10,000 / 1,000 × $0.0004 = $0.004 (less than half a cent)
Request costs are negligible for individual analyses but scale with usage:
- 1,000 full scans: $4
- 10,000 full scans: $40
For large-scale batch processing or public datasets with thousands of users, request costs accumulate.
Data Transfer (Egress) Costs¶
Egress fees apply when transferring data out of the cloud provider's network. For AWS S3:
- First 100 GB per month: Free
- Next 10 TB per month: $0.09 per GB
- Next 40 TB per month: $0.085 per GB
Example egress cost:
- Dataset: 30 GB uncompressed, 10 GB compressed
- Full download: 10 GB egress
- Cost: $0 (within free tier) or $0.90 if over free tier
Egress costs dominate for large datasets or frequent full downloads.
How Chunk Count Affects Costs¶
More chunks = more HTTP requests = higher request costs.
Chunking strategy cost impact:
- Dataset: 30 GB, chunk size 10 MB → 3,000 chunks
- Full scan: 3,000 GET requests → $0.0012
- Same dataset, chunk size 100 MB → 300 chunks
- Full scan: 300 GET requests → $0.00012 (10× cheaper)
Larger chunks reduce request costs but may waste egress bandwidth if you only need subsets. Balance chunk size based on typical access patterns (see Chunking Fundamentals for the 10-100 MB sweet spot).
Cost Minimization Strategies¶
- Use consolidated metadata — Saves 10-50 requests per dataset open
- Enable caching — Repeated chunk access costs zero after first fetch
- Optimize chunk sizes — Larger chunks reduce request counts
- Download locally for intensive work — One-time egress cost, unlimited local access
- Process in-cloud — Run analysis on cloud VMs in the same region (no egress fees)
OSN Does Not Charge Egress Fees for Academic Use
The Open Storage Network (OSN) does not charge egress fees for academic and research use, a significant cost advantage compared to commercial cloud providers. OVRO-LWA users accessing data from OSN endpoints avoid the egress costs that would apply on AWS S3 or GCS, making OSN particularly attractive for large dataset analysis and public data distribution.
Troubleshooting Cloud Access¶
Connection Timeout¶
Symptom:
Causes and solutions:
- Incorrect endpoint URL — Verify endpoint matches bucket location:
from ovro_lwa_portal import resolve_source
info = resolve_source("s3://bucket/data.zarr", storage_options={...})
print(info["endpoint"]) # Check resolved endpoint
- Network firewall blocking cloud access — Test with
curl:
If curl fails, check firewall rules or proxy settings.
- DNS resolution failure — Verify endpoint DNS:
Fix:
- Verify
endpoint_urlinstorage_options["client_kwargs"] - Test network connectivity outside Python
- Increase
connect_timeoutif network is slow
403 Forbidden¶
Symptom:
Causes and solutions:
- Invalid credentials — Access key or secret key is incorrect:
# Verify credentials by listing buckets
import boto3
s3 = boto3.client('s3', endpoint_url='https://...',
aws_access_key_id='...', aws_secret_access_key='...')
s3.list_buckets() # Should succeed if credentials valid
-
Credentials expired — Temporary credentials have expiration times. Check expiry and regenerate if needed.
-
Insufficient bucket permissions — Credentials are valid but lack permission for this bucket. Contact bucket owner to verify IAM policy grants
s3:GetObjectpermission.
Fix:
- Verify
keyandsecretinstorage_options - Check credential expiration dates
- Confirm bucket policy allows your account
NoSuchBucket¶
Symptom:
Causes and solutions:
- Typo in bucket name — Verify exact bucket name (case-sensitive):
from ovro_lwa_portal import resolve_source
info = resolve_source("s3://bucket/data.zarr", storage_options={...})
print(info["bucket"]) # Check parsed bucket name
- Wrong endpoint — Bucket exists on different OSN pod or AWS region:
- OSN buckets are pod-specific (uma1 vs caltech1)
-
AWS buckets are region-specific (us-east-1 vs us-west-2)
-
Bucket not yet created — Verify bucket exists with cloud provider console
Fix:
- Double-check bucket name for typos
- Verify
endpoint_urlmatches bucket location - Use
resolve_source()to debug DOI → URL resolution (seesrc/ovro_lwa_portal/io.py:374-484)
Slow Reads¶
Symptom:
Operations take minutes instead of seconds, or throughput is < 1 MB/s.
Causes and solutions:
- Misaligned chunks — Read chunks don't match write chunks, causing excess HTTP requests:
import zarr
store = zarr.open("s3://bucket/data.zarr", mode="r", storage_options={...})
print(store.SKY.chunks) # Check on-disk chunk shape
Compare to your chunks= parameter in open_dataset(). See
Chunking Fundamentals for alignment principles.
- Too many small chunks — Each chunk incurs ~50-100ms latency:
# Count chunks
ds = ovro_lwa_portal.open_dataset("s3://bucket/data.zarr", chunks=...)
n_chunks = ds.SKY.data.npartitions
print(f"Chunks to fetch: {n_chunks}")
If n_chunks > 1000 for a simple operation, chunk sizes are likely too
small.
- Wrong region — Accessing S3 bucket from different region adds 50-100ms
per request. Specify
region_nameinstorage_options:
ds = ovro_lwa_portal.open_dataset(
"s3://bucket/data.zarr",
storage_options={
"key": "...",
"secret": "...",
"client_kwargs": {"region_name": "us-west-2"},
}
)
- No consolidated metadata — Opening store requires dozens of metadata
requests. Verify
.zmetadataexists (see Consolidated Metadata section).
Fix:
- Align read chunks with on-disk chunks (see Read-Time Optimization for recipes)
- Use caching for repeated access (see Caching Strategies section)
- Benchmark with techniques in Benchmarking Performance to isolate bottleneck
Using resolve_source() for Debugging¶
The resolve_source() function exposes the full DOI → URL → S3 resolution chain
without loading data, useful for debugging cloud access issues:
from ovro_lwa_portal import resolve_source
info = resolve_source(
"doi:10.33569/9wsys-h7b71",
production=False,
storage_options={"key": "...", "secret": "..."}
)
print("Source type:", info["source_type"]) # "doi"
print("Original source:", info["original_source"]) # "doi:10.33569/..."
print("Resolved URL:", info["resolved_url"]) # HTTPS OSN URL
print("Final URL:", info["final_url"]) # S3 URL if converted
print("S3 URL:", info["s3_url"]) # S3 URL
print("Endpoint:", info["endpoint"]) # OSN endpoint
print("Bucket:", info["bucket"]) # Bucket name
print("Path:", info["path"]) # Path within bucket
Use this to verify:
- DOI resolution succeeds
- HTTPS → S3 conversion applies when expected
- Endpoint URL matches bucket location
- Bucket and path are parsed correctly
See src/ovro_lwa_portal/io.py:374-484 for implementation details.
See Also¶
- Chunking Fundamentals - One chunk = one HTTP request, 10-100 MB sweet spot
- Read-Time Optimization - Chunk configuration recipes for different access patterns
- Benchmarking Performance - Measure and optimize cloud read performance
- Getting Started: Loading Data - Basic
open_dataset()usage