# Zarr in Cloud Object Storage

In this tutorial, we'll cover the following:
- Finding a cloud hosted Zarr archive of CMIP6 dataset(s)
- Remote data access to a single CMIP6 dataset (sea surface height)
- Calculate future predicted sea level change in 2100 compared to 2015

In [None]:
import gcsfs
import pandas as pd
import xarray as xr

## Finding cloud native data

Cloud-native data means data that is structured for efficient querying across the network.
Typically, this means having metadata that describes the entire file in the header of the
file, or having a a separate pointer file (so that there is no need to download everything first).

Quite commonly, you'll see cloud-native datasets stored on these
three object storage providers, though there are many other ones too.

- [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3)
- [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs)
- [Google Cloud Storage](https://cloud.google.com/storage)

### Getting cloud hosted CMIP6 data

The [Coupled Model Intercomparison Project Phase 6 (CMIP6)](https://en.wikipedia.org/wiki/CMIP6#CMIP_Phase_6)
dataset is a rich archive of modelling experiments carried out to predict the climate change impacts.
The datasets are stored using the [Zarr](https://zarr.dev) format, and we'll go over how to access it.

Sources:
- https://esgf-node.llnl.gov/search/cmip6/
- CMIP6 data hosted on Google Cloud - https://console.cloud.google.com/marketplace/details/noaa-public/cmip6
- Pangeo/ESGF Cloud Data Access tutorial - https://pangeo-data.github.io/pangeo-cmip6-cloud/accessing_data.html

First, let's open a CSV containing the list of CMIP6 datasets available

In [None]:
df = pd.read_csv("https://cmip6.storage.googleapis.com/pangeo-cmip6.csv")
print(f"Number of rows: {len(df)}")
df.head()

Over 5 million rows! Let's filter it down to the variable and experiment
we're interested in, e.g. sea surface height.

For the `variable_id`, you can look it up given some keyword at
https://docs.google.com/spreadsheets/d/1UUtoz6Ofyjlpx5LdqhKcwHFz2SGoTQV2_yekHyMfL9Y

For the `experiment_id`, download the spreadsheet from
https://github.com/ES-DOC/esdoc-docs/blob/master/cmip6/experiments/spreadsheet/experiments.xlsx,
go to the 'experiment' tab, and find the one you're interested in.

Another good place to find the right model runs is https://esgf-node.llnl.gov/search/cmip6
(once you get your head around the acronyms and short names).

Below, we'll filter to CMIP6 experiments matching:
- Sea Surface Height Above Geoid [m] (variable_id: `zos`)
- Shared Socioeconomic Pathway 5 (experiment_id: `ssp585`)

In [None]:
df_zos = df.query("variable_id == 'zos' & experiment_id == 'ssp585'")
df_zos

There's 272 modelled scenarios for SSP5.
Let's just get the URL to the first one in the list for now.

In [None]:
print(df_zos.zstore.iloc[0])

## Reading from the remote Zarr storage

In many cases, you'll need to first connect to the cloud provider.
The CMIP6 dataset allows anonymous access, but for some cases,
you may need to authenticate.

In [None]:
fs = gcsfs.GCSFileSystem(token="anon")

Next, we'll need a mapping to the Google Storage object.
This can be done using `fs.get_mapper`.

A more generic way (for other cloud providers) is to use
[`fsspec.get_mapper`](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.get_mapper) instead.

In [None]:
store = fs.get_mapper(
    "gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Omon/zos/gn/v20180701/"
)

With that, we can open the Zarr store like so.

In [None]:
ds = xr.open_zarr(store=store, consolidated=True)
ds

### Selecting time slices

Let's say we want to calculate sea level change between
2015 and 2100. We can access just the specific time points
needed using [`xr.Dataset.sel`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.sel.html).

In [None]:
zos_2015jan = ds.zos.sel(time="2015-01-16").squeeze()
zos_2100dec = ds.zos.sel(time="2100-12-16").squeeze()

Sea level change would just be 2100 minus 2015.

In [None]:
sealevelchange = zos_2100dec - zos_2015jan

Note that up to this point, we have not actually downloaded any
(big) data yet from the cloud. This is all working based on
metadata only.

To bring the data from the cloud to your local computer, call `.compute`.
This will take a while depending on your connection speed.

In [None]:
sealevelchange = sealevelchange.compute()

We can do a quick plot to show how Sea Level is predicted to change
between 2015-2100 (from one modelled experiment).

In [None]:
sealevelchange.plot.imshow()

Notice the blue parts between -40 and -60 South where sea level has dropped?
That's to do with the Antarctic ice sheet losing mass and resulting in a lower
gravitational pull, resulting in a relative decrease in sea level. Over most
of the Northern Hemisphere though, sea level rise has increased between 2015 and 2100.

That's all! Hopefully this will get you started on accessing more cloud-native datasets!