# Introduction to Zarr

## Learning Objectives:

- Understand the principles of the Zarr data format
- Learn how to read and write Zarr stores using the `zarr-python` library
- Explore how to use Zarr stores with `xarray` for data analysis and visualization

This notebook provides a brief introduction to Zarr and how to
use it in cloud environments for scalable, chunked, and compressed data storage.

Zarr is a data format with implementations in different languages. In this tutorial, we will look at an example of how to use the Zarr format by looking at some features of the `zarr-python` library and how Zarr files can be opened with `xarray`.

## What is Zarr?

The Zarr data format is an open, community-maintained format designed for efficient, scalable storage of large N-dimensional arrays. It stores data as compressed and chunked arrays in a format well-suited to parallel processing and cloud-native workflows. 

### Zarr Data Organization:
- **Arrays**: N-dimensional arrays that can be chunked and compressed.
- **Groups**: A container for organizing multiple arrays and other groups with a hierarchical structure.
- **Metadata**: JSON-like metadata describing the arrays and groups, including information about data types, dimensions, chunking, compression, and user-defined key-value fields. 
- **Dimensions and Shape**: Arrays can have any number of dimensions, and their shape is defined by the number of elements in each dimension.
- **Coordinates & Indexing**: Zarr supports coordinate arrays for each dimension, allowing for efficient indexing and slicing.

The diagram below from [the Zarr v3 specification](https://wiki.earthdata.nasa.gov/display/ESO/Zarr+Format) showing the structure of a Zarr store:

![ZarrSpec](https://zarr-specs.readthedocs.io/en/latest/_images/terminology-hierarchy.excalidraw.png)


NetCDF and Zarr share similar terminology and functionality, but the key difference is that NetCDF is a single file, while Zarr is a directory-based “store” composed of many chunked files, making it better suited for distributed and cloud-based workflows.

### Zarr Fundamenals
A Zarr array has the following important properties:
- **Shape**: The dimensions of the array.
- **Dtype**: The data type of each element (e.g., float32).
- **Attributes**: Metadata stored as key-value pairs (e.g., units, description.
- **Compressors**: Algorithms used to compress each chunk (e.g., Zstd, Blosc, Zlib).


#### Example: Creating and Inspecting a Zarr Array

Here we create a simple 2D array of shape `(40, 50)` with chunks of size `(10, 10)` ,write to the `LocalStore` in the `test.zarr` directory. 


In [None]:
import zarr
import pathlib
import shutil

# Ensure we start with a clean directory for the tutorial
datadir = pathlib.Path('../data/zarr-tutorial')
if datadir.exists():
    shutil.rmtree(datadir)

output = datadir / 'test.zarr'
z = zarr.create_array(shape=(40, 50), chunks=(10, 10), dtype='f8', store=output)
z

`.info` provides a summary of the array's properties, including shape, data type, and compression settings.


In [None]:
z.info

In [None]:
z.fill_value

No data has been written to the array yet. If we try to access the data, we will get a fill value: 

In [None]:
z[0, 0]

This is how we assign data to the array. When we do this it gets written immediately.

In [None]:
import numpy as np

z[:] = 1
z[0, :] = np.arange(50)
z[:]

#### Attributes

We can attach arbitrary metadata to our Array via attributes:

In [None]:
z.attrs['units'] = 'm/s'
z.attrs['standard_name'] = 'wind_speed'
print(dict(z.attrs))

### Zarr Data Storage

Zarr can be stored in memory, on disk, or in cloud storage systems like Amazon S3.

Let's look under the hood. _The ability to look inside a Zarr store and understand what is there is a deliberate design decision._

In [None]:
z.store

In [None]:
!tree -a {output}

In [None]:
!cat {output}/zarr.json

### Hierarchical Groups

Zarr allows you to create hierarchical groups, similar to directories. To create groups in your store, use the `create_group` method after creating a root group. Here, we’ll create two groups, `temp` and `precip`.

In [None]:
store = zarr.storage.MemoryStore()
root = zarr.create_group(store=store)
temp = root.create_group('temp')
precip = root.create_group('precip')
t2m = temp.create_array('t2m', shape=(100, 100), chunks=(10, 10), dtype='i4')
prcp = precip.create_array('prcp', shape=(1000, 1000), chunks=(10, 10), dtype='i4')
root.tree()

Groups can easily be accessed by name and index.



In [None]:
display(root['temp'])
root['temp/t2m'][:, 3]

To get a look at your overall dataset, the `tree` and `info` methods are helpful.



In [None]:
root.info

In [None]:
root.tree()

#### Chunking
Chunking is the process of dividing Zarr arrays into smaller pieces. This allows for parallel processing and efficient storage.

One of the important parameters in Zarr is the chunk shape, which determines how the data is divided into smaller, manageable pieces. This is crucial for performance, especially when working with large datasets.

To examine the chunk shape of a Zarr array, you can use the `chunks` attribute. This will show you the size of each chunk in each dimension.

In [None]:
z.chunks

When selecting chunk shapes, we need to keep in mind two constraints:

- Concurrent writes are possible as long as different processes write to separate chunks, enabling highly parallel data writing. 
- When reading data, if any piece of the chunk is needed, the entire chunk has to be loaded. 

The optimal chunk shape will depend on how you want to access the data. E.g., for a 2-dimensional array, if you only ever take slices along the first dimension, then chunk across the second dimension.

Here we will compare two different chunking strategies.


In [None]:
output = datadir / 'c.zarr'
c = zarr.create_array(shape=(200, 200, 200), chunks=(1, 200, 200), dtype='f8', store=output)
c[:] = np.random.randn(*c.shape)

In [None]:
%time _ = c[:, 0, 0]

In [None]:
output = datadir / 'd.zarr'
d = zarr.create_array(shape=(200, 200, 200), chunks=(200, 200, 1), dtype='f8', store=output)
d[:] = np.random.randn(*d.shape)

In [None]:
%time _ = d[:, 0, 0]

### Sharding
When working with large arrays and small chunks, Zarr’s sharding feature can improve storage efficiency and performance. Instead of writing each chunk to a separate file—which can overwhelm file systems and cloud object stores—sharding groups multiple chunks into a single storage object.

Why Use Sharding?

- File systems struggle with too many small files.
- Small files (e.g., 1 MB or less) may waste space due to filesystem block size.
- Object storage systems (e.g., S3) can slow down with a high number of objects.

With sharding, you choose:
- Shard size: the logical shape of each shard, which is expected to include one or more chunks
- Chunk size: the shape of each compressed chunk

It is important to remember that the shard is the minimum unit of writing. This means that writers must be able to fit the entire shard (including all of the compressed chunks) in memory before writing a shard to a store.


This example shows how to create a sharded Zarr array with a chunk size of `(100, 100, 100)` and a shard size of `(1000, 1000, 1000)`. This means that each shard will contain 10 chunks, and each chunk will be of size `(100, 100, 100)`.


In [None]:
z6 = zarr.create_array(
    store={},
    shape=(10000, 10000, 1000),
    chunks=(100, 100, 100),
    shards=(1000, 1000, 1000),
    dtype='uint8',
)

z6.info


```{tip}
Choose shard and chunk sizes that balance I/O performance and manageability for your filesystem or cloud backend.
```

#### Compressors
A number of different compressors can be used with Zarr. The built-in options include Blosc, Zstandard, and Gzip. Additional compressors are available through the [NumCodecs](https://numcodecs.readthedocs.io) package, which supports LZ4, Zlib, BZ2, and LZMA. 

Let's check the compressor we used when creating the array:

In [None]:
z.compressors

If you don’t specify a compressor, by default Zarr uses the Zstandard compressor.

How much space was saved by compression?


In [None]:
z.info_complete()

You can set `compression=None` when creating a Zarr array to turn off compression. This is useful for debugging or when you want to store data without compression.

```{note}
`.info_complete()` provides a more detailed view of the Zarr array, including metadata about the chunks, compressors, and attributes, but will be slower for larger arrays. 
```

#### Consolidated Metadata
Zarr supports consolidated metadata, which allows you to store all metadata in a single file. This can improve performance when reading metadata, especially for large datasets.

So far we have only been dealing in single array Zarr data stores. In this next example, we will create a zarr store with multiple arrays and then consolidate metadata. The speed up is significant when dealing in remote storage options, which we will see in the following example on accessing cloud storage.

In [None]:
store = zarr.storage.MemoryStore()
group = zarr.create_group(store=store)
group.create_array(shape=(1,), name='a', dtype='float64')
group.create_array(shape=(2, 2), name='b', dtype='float64')
group.create_array(shape=(3, 3, 3), name='c', dtype='float64')
zarr.consolidate_metadata(store)

Now, if we open that group, the Group’s metadata has a zarr.core.group.ConsolidatedMetadata that can be used:

In [None]:
from pprint import pprint

consolidated = zarr.open_group(store=store)
consolidated_metadata = consolidated.metadata.consolidated_metadata.metadata

pprint(dict(sorted(consolidated_metadata.items())))

Note that while Zarr-Python supports consolidated metadata for v2 and v3 formatted Zarr stores, it is not technically part of the specification (hence the warning above). 

⚠️ Use Caution When ⚠️
- **Stale or incomplete consolidated metadata**: If the dataset is updated but the consolidated metadata entrypoint isn't re-consolidated, readers may miss chunks or metadata. Always run zarr.consolidate_metadata() after changes.
- **Concurrent writes or multi-writer pipelines**: Consolidated metadata can lead to inconsistent reads if multiple processes write without coordination. Use with caution in dynamic or shared write environments.
- **Local filesystems or mixed toolchains**: On local storage, consolidation offers little benefit as hierarchy discovery is generally quite cheap. 

### Object Storage as a Zarr Store

Zarr’s layout (many files/chunks per array) maps perfectly onto object storage, such as Amazon S3, Google Cloud Storage, or Azure Blob Storage. Each chunk is stored as a separate object, enabling distributed reads/writes.



Here are some examples of Zarr stores on the cloud:

* [Zarr data in Microsoft's Planetary Computer](https://planetarycomputer.microsoft.com/catalog?filter=zarr)
* [Zarr data from Google](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset&_ga=2.226354714.1000882083.1692116148-1788942020.1692116148&pli=1&q=zarr)
* [Amazon Sustainability Data Initiative available from Registry of Open Data on AWS](https://registry.opendata.aws/collab/asdi/) - Enter "Zarr" in the Search input box.
* [Pangeo-Forge Data Catalog](https://pangeo-forge.org/catalog)


### Xarray and Zarr

Xarray has built-in support for reading and writing Zarr data. You can use the `xarray.open_zarr()` function to open a Zarr store as an Xarray dataset.



In [None]:
import xarray as xr

store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'

ds = xr.open_dataset(store, engine='zarr', chunks={}, consolidated=True)
ds

In [None]:
ds.precip

::::{admonition} Exercise
:class: tip

Can you calculate the mean precipitation for January 2020 in the GPCP dataset and plot it?

:::{admonition} Solution
:class: dropdown

```python
ds.precip.sel(time=slice('2020-01-01', '2020-01-31')).mean(dim='time').plot()
```
:::
::::

Check out our other [tutorial notebook](<project:#cmip6-cloud>) that highlights the CMIP6 Zarr dataset stored in the Cloud

## Additional Resources

- [Zarr Documentation](https://zarr.readthedocs.io/en/stable/)
- [Cloud Optimized Geospatial Formats](https://guide.cloudnativegeo.org/zarr/zarr-in-practice.html)
- [Scalable and Computationally Reproducible Approaches to Arctic Research](https://learning.nceas.ucsb.edu/2025-04-arctic/sections/zarr.html)
- [Zarr Cloud Native Geospatial Tutorial](https://github.com/zarr-developers/tutorials/blob/main/zarr_cloud_native_geospatial_2022.ipynb)