| name | zarr-python |
| description | Chunked N-D arrays for cloud storage (Zarr-Python 3). Compressed arrays, parallel I/O, S3/GCS via fsspec, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines. |
| allowed-tools | Read Write Edit Bash |
| license | MIT license |
| compatibility | Requires Python 3.12+ and zarr 3.x. Cloud I/O needs zarr[remote] plus s3fs or gcsfs. Legacy Zarr v2 workflows use zarr==2.* on older Python. |
| metadata | version: "1.0" skill-author: K-Dense Inc. |
Zarr Python
Overview
Zarr is a Python library for storing large N-dimensional arrays with chunking and compression. Apply this skill for efficient parallel I/O, cloud-native workflows, and seamless integration with NumPy, Dask, and Xarray.
Current upstream: zarr 3.2.1 (PyPI, May 2026). Docs: zarr.readthedocs.io. New arrays default to Zarr format 3; set zarr_format=2 for legacy interop. This skill is a community guide maintained by K-Dense Inc., not an official zarr-developers package.
Quick Start
Installation
uv pip install "zarr>=3.2,<4"
Requires Python 3.12+ (per PyPI metadata for zarr 3.2.x). For remote stores (S3, GCS, HTTP):
uv pip install "zarr[remote]"
uv pip install s3fs
uv pip install gcsfs
Pin zarr>=3,<4 in application dependencies. Use uv pip install "zarr==2.*" only when you must stay on Zarr-Python 2 / Python 3.10โ3.11.
Basic Array Creation
import zarr
import numpy as np
z = zarr.create_array(
store="data/my_array.zarr",
shape=(10000, 10000),
chunks=(1000, 1000),
dtype="f4"
)
z[:, :] = np.random.random((10000, 10000))
data = z[0:100, 0:100]
Core Operations
Creating Arrays
Zarr provides multiple convenience functions for array creation:
z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000), dtype='f4',
store='data.zarr')
z = zarr.ones((5000, 5000), chunks=(500, 500))
z = zarr.full((1000, 1000), fill_value=42, chunks=(100, 100))
data = np.arange(10000).reshape(100, 100)
z = zarr.array(data, chunks=(10, 10), store='data.zarr')
z2 = zarr.zeros_like(z)
Opening Existing Arrays
z = zarr.open_array('data.zarr', mode='r+')
z = zarr.open_array('data.zarr', mode='r')
z = zarr.open('data.zarr')
Reading and Writing Data
Zarr arrays support NumPy-like indexing:
z[:] = 42
z[0, :] = np.arange(100)
z[10:20, 50:60] = np.random.random((10, 10))
data = z[0:100, 0:100]
row = z[5, :]
z.vindex[[0, 5, 10], [2, 8, 15]]
z.oindex[0:10, [5, 10, 15]]
z.blocks[0, 0]
Resizing and Appending
z.resize((15000, 15000))
z.append(np.random.random((1000, 10000)), axis=0)
Chunking Strategies
Chunking is critical for performance. Choose chunk sizes and shapes based on access patterns.
Chunk Size Guidelines
- Minimum chunk size: 1 MB recommended for optimal performance
- Balance: Larger chunks = fewer metadata operations; smaller chunks = better parallel access
- Memory consideration: Entire chunks must fit in memory during compression
z = zarr.zeros(
shape=(10000, 10000),
chunks=(512, 512),
dtype='f4'
)
Aligning Chunks with Access Patterns
Critical: Chunk shape dramatically affects performance based on how data is accessed.
z = zarr.zeros((10000, 10000), chunks=(10, 10000))
z = zarr.zeros((10000, 10000), chunks=(10000, 10))
z = zarr.zeros((10000, 10000), chunks=(1000, 1000))
Performance example: For a (200, 200, 200) array, reading along the first dimension:
- Using chunks (1, 200, 200): ~107ms
- Using chunks (200, 200, 1): ~1.65ms (65ร faster!)
Sharding for Large-Scale Storage
When arrays have millions of small chunks, use sharding to group chunks into larger storage objects:
from zarr.codecs import BloscCodec, BytesCodec, ShardingCodec
z = zarr.create_array(
store='data.zarr',
shape=(100000, 100000),
chunks=(100, 100),
shards=(1000, 1000),
dtype='f4'
)
Benefits:
- Reduces file system overhead from millions of small files
- Improves cloud storage performance (fewer object requests)
- Prevents filesystem block size waste
Important: Entire shards must fit in memory before writing.
Compression
Zarr applies compression per chunk to reduce storage while maintaining fast access.
Configuring Compression
from zarr.codecs import BloscCodec, GzipCodec, ZstdCodec, BytesCodec
z = zarr.zeros((1000, 1000), chunks=(100, 100))
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
)
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[GzipCodec(level=6)]
)
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[BytesCodec()]
)
Compression Performance Tips
- Blosc (default): Fast compression/decompression, good for interactive workloads
- Zstandard: Better compression ratios, slightly slower than LZ4
- Gzip: Maximum compression, slower performance
- LZ4: Fastest compression, lower ratios
- Shuffle: Enable shuffle filter for better compression on numeric data
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
codecs=[BloscCodec(cname='lz4', clevel=1)]
codecs=[GzipCodec(level=9)]
Storage Backends
Zarr supports multiple storage backends through a flexible storage interface.
Local Filesystem (Default)
from zarr.storage import LocalStore
store = LocalStore('data/my_array.zarr')
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
z = zarr.open_array('data/my_array.zarr', mode='w', shape=(1000, 1000),
chunks=(100, 100))
In-Memory Storage
from zarr.storage import MemoryStore
store = MemoryStore()
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
ZIP File Storage
from zarr.storage import ZipStore
store = ZipStore('data.zip', mode='w')
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
z[:] = np.random.random((1000, 1000))
store.close()
store = ZipStore('data.zip', mode='r')
z = zarr.open_array(store=store)
data = z[:]
store.close()
Cloud Storage (S3, GCS)
Zarr 3 uses fsspec backends via URI strings or FsspecStore (preferred over legacy S3Map/GCSMap).
import zarr
z = zarr.create_array(
store="s3://my-bucket/path/to/array.zarr",
shape=(1000, 1000),
chunks=(100, 100),
dtype="f4",
storage_options={"anon": False},
)
z[:] = data
z = zarr.open_array(
"gs://my-bucket/path/to/array.zarr",
mode="r",
storage_options={"project": "my-project"},
)
from zarr.storage import FsspecStore
store = FsspecStore.from_url("s3://my-bucket/data.zarr", storage_options={"anon": False})
root = zarr.open_group(store=store, mode="r+")
Cloud backends read credentials from provider environment variables locally via fsspec; they are not sent to third-party endpoints outside your configured bucket/project.
Cloud Storage Best Practices:
- Use consolidated metadata to reduce latency:
zarr.consolidate_metadata(store)
- Align chunk sizes with cloud object sizing (typically 5-100 MB optimal)
- Enable parallel writes using Dask for large-scale data
- Consider sharding to reduce number of objects
Groups and Hierarchies
Groups organize multiple arrays hierarchically, similar to directories or HDF5 groups.
Creating and Using Groups
root = zarr.group(store='data/hierarchy.zarr')
temperature = root.create_group('temperature')
precipitation = root.create_group('precipitation')
temp_array = temperature.create_array(
name='t2m',
shape=(365, 720, 1440),
chunks=(1, 720, 1440),
dtype='f4'
)
precip_array = precipitation.create_array(
name='prcp',
shape=(365, 720, 1440),
chunks=(1, 720, 1440),
dtype='f4'
)
array = root['temperature/t2m']
print(root.tree())
Group API (v3)
Use create_array / require_array (h5py-style create_dataset / require_dataset were removed in v3):
root = zarr.group('data.zarr')
arr = root.create_array('my_data', shape=(1000, 1000), chunks=(100, 100), dtype='f4')
grp = root.require_group('subgroup')
arr2 = grp.require_array('array', shape=(500, 500), chunks=(50, 50), dtype='i4')
Attributes and Metadata
Attach custom metadata to arrays and groups using attributes:
z = zarr.zeros((1000, 1000), chunks=(100, 100))
z.attrs['description'] = 'Temperature data in Kelvin'
z.attrs['units'] = 'K'
z.attrs['created'] = '2024-01-15'
z.attrs['processing_version'] = 2.1
print(z.attrs['units'])
root = zarr.group('data.zarr')
root.attrs['project'] = 'Climate Analysis'
root.attrs['institution'] = 'Research Institute'
z2 = zarr.open('data.zarr')
print(z2.attrs['description'])
Important: Attributes must be JSON-serializable (strings, numbers, lists, dicts, booleans, null).
Integration with NumPy, Dask, and Xarray
NumPy Integration
Zarr arrays implement the NumPy array interface:
import numpy as np
import zarr
z = zarr.zeros((1000, 1000), chunks=(100, 100))
result = np.sum(z, axis=0)
mean = np.mean(z[:100, :100])
numpy_array = z[:]
Dask Integration
Dask provides lazy, parallel computation on Zarr arrays:
import dask.array as da
import zarr
z = zarr.open('data.zarr', mode='w', shape=(100000, 100000),
chunks=(1000, 1000), dtype='f4')
dask_array = da.from_zarr('data.zarr')
result = dask_array.mean(axis=0).compute()
large_array = da.random.random((100000, 100000), chunks=(1000, 1000))
da.to_zarr(large_array, 'output.zarr')
Benefits:
- Process datasets larger than memory
- Automatic parallel computation across chunks
- Efficient I/O with chunked storage
Xarray Integration
Xarray provides labeled, multidimensional arrays with Zarr backend:
import xarray as xr
import zarr
ds = xr.open_zarr('data.zarr')
print(ds)
temperature = ds['temperature']
subset = ds.sel(time='2024-01', lat=slice(30, 60))
ds.to_zarr('output.zarr')
ds = xr.Dataset(
{
'temperature': (['time', 'lat', 'lon'], data),
'precipitation': (['time', 'lat', 'lon'], data2)
},
coords={
'time': pd.date_range('2024-01-01', periods=365),
'lat': np.arange(-90, 91, 1),
'lon': np.arange(-180, 180, 1)
}
)
ds.to_zarr('climate_data.zarr')
Benefits:
- Named dimensions and coordinates
- Label-based indexing and selection
- Integration with pandas for time series
- NetCDF-like interface familiar to climate/geospatial scientists
Parallel Computing and Thread Safety
The synchronizer argument (ThreadSynchronizer, ProcessSynchronizer) is not ported to Zarr-Python 3 yet. Use these patterns instead:
- Reads: always safe across threads/processes.
- Writes: safe when each worker writes to non-overlapping chunks; most stores support atomic chunk writes.
- Overlapping writes: coordinate externally (file locks, workflow design) until synchronizers return.
For Dask-heavy workloads, tune Zarr async concurrency โ see Optimizing performance.
Consolidated Metadata
For hierarchical stores with many arrays, consolidate metadata into a single file to reduce I/O operations:
import zarr
root = zarr.group('data.zarr')
zarr.consolidate_metadata('data.zarr')
root = zarr.open_consolidated('data.zarr')
Benefits:
- Reduces metadata read operations from N (one per array) to 1
- Critical for cloud storage (reduces latency)
- Speeds up
tree() operations and group traversal
Cautions:
- Metadata can become stale if arrays update without re-consolidation
- Not suitable for frequently-updated datasets
- Multi-writer scenarios may have inconsistent reads
Performance Optimization
Checklist for Optimal Performance
-
Chunk Size: Aim for 1-10 MB per chunk
chunks = (512, 512)
-
Chunk Shape: Align with access patterns
-
Compression: Choose based on workload
-
Storage Backend: Match to environment
-
Sharding: Use for large-scale datasets
shards=(10*chunk_size, 10*chunk_size)
-
Parallel I/O: Use Dask for large operations
import dask.array as da
dask_array = da.from_zarr('data.zarr')
result = dask_array.compute(scheduler='threads', num_workers=8)
Profiling and Debugging
print(z.info)
print(f"Compressed size: {z.nbytes_stored / 1e6:.2f} MB")
print(f"Uncompressed size: {z.nbytes / 1e6:.2f} MB")
print(f"Compression ratio: {z.nbytes / z.nbytes_stored:.2f}x")
Common Patterns and Best Practices
Pattern: Time Series Data
z = zarr.open('timeseries.zarr', mode='a',
shape=(0, 720, 1440),
chunks=(1, 720, 1440),
dtype='f4')
new_data = np.random.random((1, 720, 1440))
z.append(new_data, axis=0)
Pattern: Large Matrix Operations
import dask.array as da
z = zarr.open('matrix.zarr', mode='w',
shape=(100000, 100000),
chunks=(1000, 1000),
dtype='f8')
dask_z = da.from_zarr('matrix.zarr')
result = (dask_z @ dask_z.T).compute()
Pattern: Cloud-Native Workflow
import zarr
path = "s3://my-bucket/data.zarr"
z = zarr.create_array(
store=path,
shape=(10000, 10000),
chunks=(500, 500),
dtype="f4",
storage_options={"anon": False},
)
z[:] = data
zarr.consolidate_metadata(path)
z_read = zarr.open_consolidated(path, storage_options={"anon": False})
subset = z_read[0:100, 0:100]
Pattern: Format Conversion
import h5py
import zarr
with h5py.File('data.h5', 'r') as h5:
dataset = h5['dataset_name']
z = zarr.array(dataset[:],
chunks=(1000, 1000),
store='data.zarr')
import numpy as np
data = np.load('data.npy')
z = zarr.array(data, chunks='auto', store='data.zarr')
import xarray as xr
ds = xr.open_zarr('data.zarr')
ds.to_netcdf('data.nc')
Common Issues and Solutions
Issue: Slow Performance
Diagnosis: Check chunk size and alignment
print(z.chunks)
print(z.info)
Solutions:
- Increase chunk size to 1-10 MB
- Align chunks with access pattern
- Try different compression codecs
- Use Dask for parallel operations
Issue: High Memory Usage
Cause: Loading entire array or large chunks into memory
Solutions:
for i in range(0, z.shape[0], 1000):
chunk = z[i:i+1000, :]
process(chunk)
import dask.array as da
dask_z = da.from_zarr('data.zarr')
result = dask_z.mean().compute()
Issue: Cloud Storage Latency
Solutions:
zarr.consolidate_metadata(store)
z = zarr.open_consolidated(store)
chunks = (2000, 2000)
shards = (10000, 10000)
Issue: Concurrent Write Conflicts
Solution: Design workflows so each process/thread writes to separate chunks. Zarr-Python 3 does not yet support ThreadSynchronizer / ProcessSynchronizer; see references/v3_migration.md.
Additional Resources
Bundled references
| File | Contents |
|---|
references/api_reference.md | Function signatures, stores, codecs, indexing |
references/v3_migration.md | Zarr-Python 2โ3 breaking changes and WIP features |
Official upstream
Related libraries: Xarray, Dask, NumCodecs