Access Patterns to Remote Data with fsspec#

Accessing remote data with xarray usually means working with cloud-optimized formats like Zarr or COGs, the CMIP6 tutorial shows this pattern in detail. These formats were designed to be efficiently accessed over the internet, however in many cases we might need to access data that is not available in such formats.

This notebook will explore how we can leverage xarray’s backends to access remote files. For this we will make use of fsspec, a powerful Python library that abstracts the internal implementation of remote storage systems into a uniform API that can be used by many file-format specific libraries.

Before starting with remote data, it may be helpful to understand how xarray handles local files and how xarray backends work. The following diagram shows the different components involved in accessing data either locally or remote using the h5netcdf backend which uses a format specific library to access HDF5 files.

xarray-access(3)

Let’s consider a scenario where we have a local NetCDF4 file containing gridded data. NetCDF is a common file format used in scientific research for storing array-like data.

import xarray as xr

localPath = "../../data/sst.mnmean.nc"

ds = xr.open_dataset(localPath)
ds
<xarray.Dataset> Size: 8MB
Dimensions:  (lat: 89, lon: 180, time: 128)
Coordinates:
  * lat      (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
  * lon      (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
  * time     (time) datetime64[ns] 1kB 2010-01-01 2010-02-01 ... 2020-08-01
Data variables:
    sst      (time, lat, lon) float32 8MB ...
Attributes: (12/37)
    climatology:               Climatology is based on 1971-2000 SST, Xue, Y....
    description:               In situ data: ICOADS2.5 before 2007 and NCEP i...
    keywords_vocabulary:       NASA Global Change Master Directory (GCMD) Sci...
    keywords:                  Earth Science > Oceans > Ocean Temperature > S...
    instrument:                Conventional thermometers
    source_comment:            SSTs were observed by conventional thermometer...
    ...                        ...
    creator_url_original:      https://www.ncei.noaa.gov
    license:                   No constraints on data access or use
    comment:                   SSTs were observed by conventional thermometer...
    summary:                   ERSST.v5 is developed based on v4 after revisi...
    dataset_title:             NOAA Extended Reconstructed SST V5
    data_modified:             2020-09-07

xarray backends under the hood#

  • What happened when we ran xr.open_dataset("path-to-file")?

As we know xarray is a very flexible and modular library. When we open a file, we are asking xarray to use one of its format specific engines to get the actual array data from the file into memory. File formats come in different flavors, from general purpose HDF5 to the very domain-specific ones like GRIB2. When we call open_dataset() the first thing xarray does is try to guess which of the preinstalled backends can handle this file, in this case we pass a string with a valid local path.

We’ll use a helper function to print a simplified call stack and see what’s going on under the hood.

import sys
from IPython.display import Code


tracing_output = []
_match_pattern = "xarray"


def trace_calls(frame, event, arg):
    if event == 'call':
        code = frame.f_code
        func_name = code.co_name
        func_file = code.co_filename.split("/site-packages/")[-1]
        func_line = code.co_firstlineno
        if not func_name.startswith("_") and _match_pattern in func_file:
            tracing_output.append(f"def {func_name}() at {func_file}:{func_line}")
    return trace_calls


# we enable tracing and call open_dataset()
sys.settrace(trace_calls)
ds = xr.open_dataset(localPath)
sys.settrace(None)

# Print the trace with some syntax highlighting
Code(" \n".join(tracing_output[0:10]), language='python')
def cast() at /home/runner/micromamba/envs/xarray-tutorial/lib/python3.12/typing.py:2173 
def cast() at /home/runner/micromamba/envs/xarray-tutorial/lib/python3.12/typing.py:2173 
def helper() at /home/runner/micromamba/envs/xarray-tutorial/lib/python3.12/contextlib.py:299 
def open_dataset() at xarray/backends/api.py:391 
def guess_engine() at xarray/backends/plugins.py:147 
def guess_can_open() at xarray/backends/netCDF4_.py:607 
def is_remote_uri() at xarray/core/utils.py:641 
def search() at /home/runner/micromamba/envs/xarray-tutorial/lib/python3.12/re/__init__.py:174 
def try_read_magic_number_from_path() at xarray/core/utils.py:664 
def read_magic_number_from_file() at xarray/core/utils.py:650

What are we seeing?#

  • xarray uses guess_engine() to identify which backend can open the file.

  • guess_engine() will loop through the preinstalled backends and will run guess_can_open().

  • if an engine can handle the file type it will verify that we are working with a local file.

  • Once that we know which backend we’ll use we invoke that backend implementation of open_dataset().

Let’s tell xarray which backend we need for our local file.

tracing_output = []

sys.settrace(trace_calls)
ds = xr.open_dataset(localPath, engine="h5netcdf")
sys.settrace(None)

# Print the top 10 calls to public methods
Code(" \n".join(tracing_output[0:10]), language='python')
def cast() at /home/runner/micromamba/envs/xarray-tutorial/lib/python3.12/typing.py:2173 
def cast() at /home/runner/micromamba/envs/xarray-tutorial/lib/python3.12/typing.py:2173 
def helper() at /home/runner/micromamba/envs/xarray-tutorial/lib/python3.12/contextlib.py:299 
def open_dataset() at xarray/backends/api.py:391 
def get_backend() at xarray/backends/plugins.py:200 
def open_dataset() at xarray/backends/h5netcdf_.py:383 
def is_remote_uri() at xarray/core/utils.py:641 
def search() at /home/runner/micromamba/envs/xarray-tutorial/lib/python3.12/re/__init__.py:174 
def open() at xarray/backends/h5netcdf_.py:135 
def update_wrapper() at /home/runner/micromamba/envs/xarray-tutorial/lib/python3.12/functools.py:35

It is important to note that there are overlaps between the pre-installed backends in xarray. Many of these backends support the same formats (e.g., NetCDF-4), and xarray uses them in a specific order unless a particular backend is specified. For example, when we request the h5netcdf engine, xarray will not attempt to guess the backend. However, it will still check if the URI is remote, which will involve some calls to a context manager. By examining the call stack, we can observe the use of a file handler and a cache, which are crucial for efficiently accessing remote files.

Supported file formats by backend#

The open_dataset() method is our entry point to n-dimensional data with xarray, the first argument we pass indicates what we want to open and is used by xarray to get the right backend and in turn is used by the backend to open the file locally or remote. The accepted types by xarray are:

  • str: “my-file.nc” or “s3:://my-zarr-store/data.zarr”

  • os.PathLike: Posix compatible path, most of the times is a Pathlib cross-OS compatible path.

  • BufferedIOBase: some xarray backends can read data from a buffer, this is key for remote access.

  • AbstractDataStore: This one is the generic store and backends should subclass it, if we do we can pass a “store” to xarray like in the case of Opendap/Pydap

# Listing which backends we have available, if we install more they should show up here.
xr.backends.list_engines()
{'netcdf4': <NetCDF4BackendEntrypoint>
   Open netCDF (.nc, .nc4 and .cdf) and most HDF5 files using netCDF4 in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.NetCDF4BackendEntrypoint.html,
 'h5netcdf': <H5netcdfBackendEntrypoint>
   Open netCDF (.nc, .nc4 and .cdf) and most HDF5 files using h5netcdf in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.H5netcdfBackendEntrypoint.html,
 'scipy': <ScipyBackendEntrypoint>
   Open netCDF files (.nc, .nc4, .cdf and .gz) using scipy in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.ScipyBackendEntrypoint.html,
 'pydap': <PydapBackendEntrypoint>
   Open remote datasets via OPeNDAP using pydap in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.PydapBackendEntrypoint.html,
 'rasterio': <RasterioBackend>,
 'store': <StoreBackendEntrypoint>
   Open AbstractDataStore instances in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.StoreBackendEntrypoint.html,
 'zarr': <ZarrBackendEntrypoint>
   Open zarr files (.zarr) using zarr in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.ZarrBackendEntrypoint.html}

Trying to access a file on cloud storage (AWS S3)#

Now let’s try to open a file on a remote file system, this will fail and we’ll take a look into why it failed and how we’ll use fsspec to overcome this.

try:
    ds = xr.open_dataset("s3://its-live-data/test-space/sample-data/sst.mnmean.nc")
except Exception as e:
    print(e)
[Errno -72] NetCDF: Malformed or inaccessible DAP2 DDS or DAP4 DMR response: 's3://its-live-data/test-space/sample-data/sst.mnmean.nc'
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <?xml^ version="1.0" encoding="UTF-8"?><Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>its-live-data.s3-us-west-2.amazonaws.com</Endpoint><Bucket>its-live-data</Bucket><RequestId>KBRCBNG0B4JGA0Y8</RequestId><HostId>m2w2nKAOfXKu/OpmbwjmbkSQ+wuD5PlGZthEByc2qzJrBCYqcHveuy4IH81adK4TYj9njO9dn27krPM0cH7LJHAZupyz02J5</HostId></Error>

xarray iterated through the registered backends and netcdf4 returned a "yes, I can open that extension" see: netCDF4_.py#L618 . However, the backend doesn’t know how to “talk” to a remote store and thus it fails to open our file.

Supported format + Read from Buffers = Remote access#

Some of xarray’s backends can read and write data to memory, this coupled with fsspec’s ability to abstract remote files allows us to access remote files as if they were local. The following table helps us to identify if a backend can be used to access remote files with fsspec.

Backend

HDF/NetCDF Support

Can Read from Buffer

Handles Own I/O

netCDF4

Yes

No

Yes

scipy

Limited

Yes

Yes

pydap

Yes

No

No

h5netcdf

Yes

Yes

Yes

zarr

No

Yes

Yes

cfgrib

Yes

No

Yes

rasterio

Partial

Yes

No

Can Read from Buffer: Libraries that can read from buffers do not need to open a file using the operating system machinery and they allow the use of memory to open our files in whatever way we want as long as we have a seekable buffer (random access).

Handles Own I/O: Some libraries have self contained code that can handle I/O, compression, codecs and data access. Some engines task their I/O to lower level libraries. This is the case with rasterio that uses GDAL to access raster files. If a Library is in control of its own I/O operations can be easily adapted to read from buffers.

graph TD A["netCDF-4 (.nc, .nc4) and most HDF5 files"] -->|netcdf4| B["Remote Access: No"] A -->|h5netcdf| C["Remote Access: Yes"] D["netCDF files (.nc, .cdf, .gz)"] -->|scipy| E["Remote Access: Yes"] F["zarr files (.zarr)"] -->|zarr| G["Remote Access: Yes"] H["OpenDAP"] -->|pydap| I["Remote Access: Yes"]

Remote Access and File Caching#

When we use fsspec to abstract a remote file we are in essence translating byte requests to HTTP range requests over the internet. An HTTP request is a costly I/O operation compared to accessing a local file. Because of this, it’s common that libraries that handle over the network data transfers implement a cache to avoid requesting the same data over and over. In the case of fsspec there are different ways to ask the library to handle this caching and this is one of the most relevant performance considerations when we work with xarray and remote data.

fsspec default cache is called read-ahead and as its name suggests it will read ahead of our request a fixed amount of bytes, this is good when we are working with text or tabular data but it’s really an anti pattern when we work with scientific data formats. Benchmarks show that any of the caching schemas will perform better than using the default read-ahead.

fsspec caching implementations.#

simple cache + open_local()#

The simplest way to use fsspec is to cache remote files locally. Since we are using a local storage for our cache, backends like netcdf4 will be reading from disk avoiding the issue of not being able to read directly from buffers. This pattern can be applied to different backends that don’t support buffers with the disadvantage that we’ll be caching whole files and using disk space.

import fsspec

uri = "https://its-live-data.s3-us-west-2.amazonaws.com/test-space/sample-data/sst.mnmean.nc"
# we prepend the cache type to the URI, this is called protocol chaining in fsspec-speak
file = fsspec.open_local(f"simplecache::{uri}", filecache={'cache_storage': '/tmp/fsspec_cache'})

ds = xr.open_dataset(file, engine="netcdf4")
ds
<xarray.Dataset> Size: 8MB
Dimensions:  (lat: 89, lon: 180, time: 128)
Coordinates:
  * lat      (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
  * lon      (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
  * time     (time) datetime64[ns] 1kB 2010-01-01 2010-02-01 ... 2020-08-01
Data variables:
    sst      (time, lat, lon) float32 8MB ...
Attributes: (12/37)
    climatology:               Climatology is based on 1971-2000 SST, Xue, Y....
    description:               In situ data: ICOADS2.5 before 2007 and NCEP i...
    keywords_vocabulary:       NASA Global Change Master Directory (GCMD) Sci...
    keywords:                  Earth Science > Oceans > Ocean Temperature > S...
    instrument:                Conventional thermometers
    source_comment:            SSTs were observed by conventional thermometer...
    ...                        ...
    creator_url_original:      https://www.ncei.noaa.gov
    license:                   No constraints on data access or use
    comment:                   SSTs were observed by conventional thermometer...
    summary:                   ERSST.v5 is developed based on v4 after revisi...
    dataset_title:             NOAA Extended Reconstructed SST V5
    data_modified:             2020-09-07

block cache + open()#

If our backend support reading from a buffer we can cache only the parts of the file that we are reading, this is useful but tricky. As we mentioned before fsspec default cache will request an overhead of 5MB ahead of the byte offset we request, and if we are reading small chunks from our file it will be really slow and incur in unnecessary transfers.

Let’s open the same file but using the h5netcdf engine and we’ll use a block cache strategy that stores predefined block sizes from our remote file.

%%time
uri = "https://its-live-data.s3-us-west-2.amazonaws.com/test-space/sample-data/sst.mnmean.nc"

fs = fsspec.filesystem('http')

fsspec_caching = {
    "cache_type": "blockcache",  # block cache stores blocks of fixed size and uses eviction using a LRU strategy.
    "block_size": 8
    * 1024
    * 1024,  # size in bytes per block, adjust depends on the file size but the recommended size is in the MB
}

# Note that if we use a context, we'll close the file after the block so operations on xarray may fail if we don't load our data arrays.
with fs.open(uri, **fsspec_caching) as file:
    ds = xr.open_dataset(file, engine="h5netcdf")
    mean = ds.sst.mean()
ds
CPU times: user 75.7 ms, sys: 4.02 ms, total: 79.7 ms
Wall time: 164 ms
<xarray.Dataset> Size: 8MB
Dimensions:  (lat: 89, lon: 180, time: 128)
Coordinates:
  * lat      (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
  * lon      (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
  * time     (time) datetime64[ns] 1kB 2010-01-01 2010-02-01 ... 2020-08-01
Data variables:
    sst      (time, lat, lon) float32 8MB -1.8 -1.8 -1.8 -1.8 ... nan nan nan
Attributes: (12/37)
    climatology:               Climatology is based on 1971-2000 SST, Xue, Y....
    description:               In situ data: ICOADS2.5 before 2007 and NCEP i...
    keywords_vocabulary:       NASA Global Change Master Directory (GCMD) Sci...
    keywords:                  Earth Science > Oceans > Ocean Temperature > S...
    instrument:                Conventional thermometers
    source_comment:            SSTs were observed by conventional thermometer...
    ...                        ...
    creator_url_original:      https://www.ncei.noaa.gov
    license:                   No constraints on data access or use
    comment:                   SSTs were observed by conventional thermometer...
    summary:                   ERSST.v5 is developed based on v4 after revisi...
    dataset_title:             NOAA Extended Reconstructed SST V5
    data_modified:             2020-09-07

Reading data from cloud storage#

So far we have only used HTTP to access a remote file, however the commercial cloud has their own implementations with specific features. fsspec allows us to talk to different cloud storage implementations hiding these details from us and the libraries we use. Now we are going to access the same file using the S3 protocol.

Note: S3, Azure blob, etc all have their names and prefixes but under the hood they still work with the HTTP protocol.

%%time
uri = "s3://its-live-data/test-space/sample-data/sst.mnmean.nc"

# If we need to pass credentials to our remote storage we can do it here, in this case this is a public bucket
fs = fsspec.filesystem('s3', anon=True)

fsspec_caching = {
    "cache_type": "blockcache",  # block cache stores blocks of fixed size and uses eviction using a LRU strategy.
    "block_size": 8
    * 1024
    * 1024,  # size in bytes per block, adjust depends on the file size but the recommended size is in the MB
}

# we are not using a context, we can use ds until we manually close it.
ds = xr.open_dataset(fs.open(uri, **fsspec_caching), engine="h5netcdf")
ds
CPU times: user 150 ms, sys: 36.8 ms, total: 186 ms
Wall time: 467 ms
<xarray.Dataset> Size: 8MB
Dimensions:  (lat: 89, lon: 180, time: 128)
Coordinates:
  * lat      (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
  * lon      (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
  * time     (time) datetime64[ns] 1kB 2010-01-01 2010-02-01 ... 2020-08-01
Data variables:
    sst      (time, lat, lon) float32 8MB ...
Attributes: (12/37)
    climatology:               Climatology is based on 1971-2000 SST, Xue, Y....
    description:               In situ data: ICOADS2.5 before 2007 and NCEP i...
    keywords_vocabulary:       NASA Global Change Master Directory (GCMD) Sci...
    keywords:                  Earth Science > Oceans > Ocean Temperature > S...
    instrument:                Conventional thermometers
    source_comment:            SSTs were observed by conventional thermometer...
    ...                        ...
    creator_url_original:      https://www.ncei.noaa.gov
    license:                   No constraints on data access or use
    comment:                   SSTs were observed by conventional thermometer...
    summary:                   ERSST.v5 is developed based on v4 after revisi...
    dataset_title:             NOAA Extended Reconstructed SST V5
    data_modified:             2020-09-07

Key Takeaways#

  1. fsspec and remote access.

fsspec is a Python library that provides a unified interface to various filesystems, enabling access to local, remote, and cloud storage systems. It supports a wide range of protocols such as http, https, s3, gcs, ftp, and many more. One of the key features of fsspec is its ability to cache remote files locally, improving performance by reducing latency and bandwidth usage.

  1. xarray Backends.

xarray backends offers flexible support for opening datasets stored in different formats and locations. By leveraging various backends along with fsspec we can open, read, and analyze complex datasets efficiently, without worrying about the underlying file format or storage mechanism.

  1. Combining fsspec with xarray

xarray can work with fsspec filesystems to open and cache remote files and use caching strategies to optimize its data transfer.

By leveraging these tools and techniques, you can efficiently manage and process large, remote datasets in a way that optimizes performance and accessibility.