Zarr in Cloud Object Storage#

In this tutorial, we’ll cover the following:

  • Finding a cloud hosted Zarr archive of CMIP6 dataset(s)

  • Remote data access to a single CMIP6 dataset (sea surface height)

  • Calculate future predicted sea level change in 2100 compared to 2015

import gcsfs
import pandas as pd
import xarray as xr

Finding cloud native data#

Cloud-native data means data that is structured for efficient querying across the network. Typically, this means having metadata that describes the entire file in the header of the file, or having a a separate pointer file (so that there is no need to download everything first).

Quite commonly, you’ll see cloud-native datasets stored on these three object storage providers, though there are many other ones too.

Getting cloud hosted CMIP6 data#

The Coupled Model Intercomparison Project Phase 6 (CMIP6) dataset is a rich archive of modelling experiments carried out to predict the climate change impacts. The datasets are stored using the Zarr format, and we’ll go over how to access it.

Sources:

First, let’s open a CSV containing the list of CMIP6 datasets available

df = pd.read_csv("https://cmip6.storage.googleapis.com/pangeo-cmip6.csv")
print(f"Number of rows: {len(df)}")
df.head()
Number of rows: 514818
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 HighResMIP CMCC CMCC-CM2-HR4 highresSST-present r1i1p1f1 Amon ps gn gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/... NaN 20170706
1 HighResMIP CMCC CMCC-CM2-HR4 highresSST-present r1i1p1f1 Amon rsds gn gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/... NaN 20170706
2 HighResMIP CMCC CMCC-CM2-HR4 highresSST-present r1i1p1f1 Amon rlus gn gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/... NaN 20170706
3 HighResMIP CMCC CMCC-CM2-HR4 highresSST-present r1i1p1f1 Amon rlds gn gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/... NaN 20170706
4 HighResMIP CMCC CMCC-CM2-HR4 highresSST-present r1i1p1f1 Amon psl gn gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/... NaN 20170706

Over 5 million rows! Let’s filter it down to the variable and experiment we’re interested in, e.g. sea surface height.

For the variable_id, you can look it up given some keyword at https://docs.google.com/spreadsheets/d/1UUtoz6Ofyjlpx5LdqhKcwHFz2SGoTQV2_yekHyMfL9Y

For the experiment_id, download the spreadsheet from ES-DOC/esdoc-docs, go to the ‘experiment’ tab, and find the one you’re interested in.

Another good place to find the right model runs is https://esgf-node.llnl.gov/search/cmip6 (once you get your head around the acronyms and short names).

Below, we’ll filter to CMIP6 experiments matching:

  • Sea Surface Height Above Geoid [m] (variable_id: zos)

  • Shared Socioeconomic Pathway 5 (experiment_id: ssp585)

df_zos = df.query("variable_id == 'zos' & experiment_id == 'ssp585'")
df_zos
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
12081 ScenarioMIP NOAA-GFDL GFDL-ESM4 ssp585 r1i1p1f1 Omon zos gn gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ES... NaN 20180701
12082 ScenarioMIP NOAA-GFDL GFDL-ESM4 ssp585 r1i1p1f1 Omon zos gr gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ES... NaN 20180701
45176 ScenarioMIP AWI AWI-CM-1-1-MR ssp585 r1i1p1f1 Omon zos gn gs://cmip6/CMIP6/ScenarioMIP/AWI/AWI-CM-1-1-MR... NaN 20181218
54265 ScenarioMIP CNRM-CERFACS CNRM-CM6-1 ssp585 r1i1p1f2 Omon zos gn gs://cmip6/CMIP6/ScenarioMIP/CNRM-CERFACS/CNRM... NaN 20190219
69190 ScenarioMIP CNRM-CERFACS CNRM-CM6-1 ssp585 r4i1p1f2 Omon zos gn gs://cmip6/CMIP6/ScenarioMIP/CNRM-CERFACS/CNRM... NaN 20190410
... ... ... ... ... ... ... ... ... ... ... ...
497517 ScenarioMIP MIROC MIROC-ES2L ssp585 r2i1p1f2 Omon zos gr1 gs://cmip6/CMIP6/ScenarioMIP/MIROC/MIROC-ES2L/... NaN 20201222
497957 ScenarioMIP MIROC MIROC-ES2L ssp585 r10i1p1f2 Omon zos gn gs://cmip6/CMIP6/ScenarioMIP/MIROC/MIROC-ES2L/... NaN 20201222
497958 ScenarioMIP MIROC MIROC-ES2L ssp585 r10i1p1f2 Omon zos gr1 gs://cmip6/CMIP6/ScenarioMIP/MIROC/MIROC-ES2L/... NaN 20201222
502853 ScenarioMIP EC-Earth-Consortium EC-Earth3-CC ssp585 r1i1p1f1 Omon zos gn gs://cmip6/CMIP6/ScenarioMIP/EC-Earth-Consorti... NaN 20210113
505073 ScenarioMIP CMCC CMCC-ESM2 ssp585 r1i1p1f1 Omon zos gn gs://cmip6/CMIP6/ScenarioMIP/CMCC/CMCC-ESM2/ss... NaN 20210126

272 rows × 11 columns

There’s 272 modelled scenarios for SSP5. Let’s just get the URL to the first one in the list for now.

print(df_zos.zstore.iloc[0])
gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Omon/zos/gn/v20180701/

Reading from the remote Zarr storage#

In many cases, you’ll need to first connect to the cloud provider. The CMIP6 dataset allows anonymous access, but for some cases, you may need to authenticate.

fs = gcsfs.GCSFileSystem(token="anon")

Next, we’ll need a mapping to the Google Storage object. This can be done using fs.get_mapper.

A more generic way (for other cloud providers) is to use fsspec.get_mapper instead.

store = fs.get_mapper(
    "gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Omon/zos/gn/v20180701/"
)

With that, we can open the Zarr store like so.

ds = xr.open_zarr(store=store, consolidated=True)
ds
<xarray.Dataset> Size: 2GB
Dimensions:    (bnds: 2, y: 576, x: 720, vertex: 4, time: 1032)
Coordinates:
  * bnds       (bnds) float64 16B 1.0 2.0
    lat        (y, x) float32 2MB dask.array<chunksize=(576, 720), meta=np.ndarray>
    lat_bnds   (y, x, vertex) float32 7MB dask.array<chunksize=(576, 720, 4), meta=np.ndarray>
    lon        (y, x) float32 2MB dask.array<chunksize=(576, 720), meta=np.ndarray>
    lon_bnds   (y, x, vertex) float32 7MB dask.array<chunksize=(576, 720, 4), meta=np.ndarray>
  * time       (time) object 8kB 2015-01-16 12:00:00 ... 2100-12-16 12:00:00
    time_bnds  (time, bnds) object 17kB dask.array<chunksize=(1032, 2), meta=np.ndarray>
  * x          (x) float64 6kB -299.8 -299.2 -298.8 -298.2 ... 58.75 59.25 59.75
  * y          (y) float64 5kB -77.91 -77.72 -77.54 -77.36 ... 89.47 89.68 89.89
Dimensions without coordinates: vertex
Data variables:
    zos        (time, y, x) float32 2GB dask.array<chunksize=(61, 576, 720), meta=np.ndarray>
Attributes: (12/49)
    Conventions:            CF-1.7 CMIP-6.0 UGRID-1.0
    activity_id:            ScenarioMIP
    branch_method:          standard
    branch_time_in_child:   60225.0
    branch_time_in_parent:  60225.0
    comment:                <null ref>
    ...                     ...
    tracking_id:            hdl:21.14100/328fd441-31bb-474f-8fd1-7e3420b72524...
    variable_id:            zos
    variant_info:           N/A
    variant_label:          r1i1p1f1
    netcdf_tracking_ids:    hdl:21.14100/328fd441-31bb-474f-8fd1-7e3420b72524...
    version_id:             v20180701

Selecting time slices#

Let’s say we want to calculate sea level change between 2015 and 2100. We can access just the specific time points needed using xr.Dataset.sel.

zos_2015jan = ds.zos.sel(time="2015-01-16").squeeze()
zos_2100dec = ds.zos.sel(time="2100-12-16").squeeze()

Sea level change would just be 2100 minus 2015.

sealevelchange = zos_2100dec - zos_2015jan

Note that up to this point, we have not actually downloaded any (big) data yet from the cloud. This is all working based on metadata only.

To bring the data from the cloud to your local computer, call .compute. This will take a while depending on your connection speed.

sealevelchange = sealevelchange.compute()

We can do a quick plot to show how Sea Level is predicted to change between 2015-2100 (from one modelled experiment).

sealevelchange.plot.imshow()
<matplotlib.image.AxesImage at 0x7f6781db2ff0>
../../_images/c1ae90e552b5eb4dfb467d33caa25834667407957f7acad023c51f7b2b72f167.png

Notice the blue parts between -40 and -60 South where sea level has dropped? That’s to do with the Antarctic ice sheet losing mass and resulting in a lower gravitational pull, resulting in a relative decrease in sea level. Over most of the Northern Hemisphere though, sea level rise has increased between 2015 and 2100.

That’s all! Hopefully this will get you started on accessing more cloud-native datasets!