Access Patterns to Remote Data with fsspec

Access Patterns to Remote Data with fsspec#

Accessing remote data with xarray usually means working with cloud-optimized formats like Zarr or COGs, the CMIP6 tutorial shows this pattern in detail. These formats were designed to be efficiently accessed over the internet, however in many cases we might need to access data that is not available in such formats.

This notebook will explore how we can leverage xarray’s backends to access remote files. For this we will make use of fsspec, a powerful Python library that abstracts the internal implementation of remote storage systems into a uniform API that can be used by many file-format specific libraries.

Before starting with remote data, it may be helpful to understand how xarray handles local files and how xarray backends work. The following diagram shows the different components involved in accessing data either locally or remote using the h5netcdf backend which uses a format specific library to access HDF5 files.

xarray-access(3)

Let’s consider a scenario where we have a local NetCDF4 file containing gridded data. NetCDF is a common file format used in scientific research for storing array-like data.

xarray backends under the hood#

What happened when we ran xr.open_dataset("path-to-file")?

As we know xarray is a very flexible and modular library. When we open a file, we are asking xarray to use one of its format specific engines to get the actual array data from the file into memory. File formats come in different flavors, from general purpose HDF5 to the very domain-specific ones like GRIB2. When we call open_dataset() the first thing xarray does is try to guess which of the preinstalled backends can handle this file, in this case we pass a string with a valid local path.

We’ll use a helper function to print a simplified call stack and see what’s going on under the hood.

import sys
from IPython.display import Code


tracing_output = []
_match_pattern = "xarray"


def trace_calls(frame, event, arg):
    if event == 'call':
        code = frame.f_code
        func_name = code.co_name
        func_file = code.co_filename.split("/site-packages/")[-1]
        func_line = code.co_firstlineno
        if not func_name.startswith("_") and _match_pattern in func_file:
            tracing_output.append(f"def {func_name}() at {func_file}:{func_line}")
    return trace_calls


# we enable tracing and call open_dataset()
sys.settrace(trace_calls)
ds = xr.open_dataset(localPath)
sys.settrace(None)

# Print the trace with some syntax highlighting
Code(" \n".join(tracing_output[0:10]), language='python')

def cast() at /home/runner/work/xarray-tutorial/xarray-tutorial/.pixi/envs/default/lib/python3.12/typing.py:2187 
def cast() at /home/runner/work/xarray-tutorial/xarray-tutorial/.pixi/envs/default/lib/python3.12/typing.py:2187 
def helper() at /home/runner/work/xarray-tutorial/xarray-tutorial/.pixi/envs/default/lib/python3.12/contextlib.py:299 
def open_dataset() at xarray/backends/api.py:480 
def guess_engine() at xarray/backends/plugins.py:140 
def guess_can_open() at xarray/backends/netCDF4_.py:632 
def is_remote_uri() at xarray/core/utils.py:672 
def search() at /home/runner/work/xarray-tutorial/xarray-tutorial/.pixi/envs/default/lib/python3.12/re/__init__.py:174 
def try_read_magic_number_from_path() at xarray/core/utils.py:695 
def read_magic_number_from_file() at xarray/core/utils.py:681

What are we seeing?#

xarray uses guess_engine() to identify which backend can open the file.
guess_engine() will loop through the preinstalled backends and will run guess_can_open().
if an engine can handle the file type it will verify that we are working with a local file.
Once that we know which backend we’ll use we invoke that backend implementation of open_dataset().

Let’s tell xarray which backend we need for our local file.

tracing_output = []

sys.settrace(trace_calls)
ds = xr.open_dataset(localPath, engine="h5netcdf")
sys.settrace(None)

# Print the top 10 calls to public methods
Code(" \n".join(tracing_output[0:10]), language='python')

def cast() at /home/runner/work/xarray-tutorial/xarray-tutorial/.pixi/envs/default/lib/python3.12/typing.py:2187 
def cast() at /home/runner/work/xarray-tutorial/xarray-tutorial/.pixi/envs/default/lib/python3.12/typing.py:2187 
def helper() at /home/runner/work/xarray-tutorial/xarray-tutorial/.pixi/envs/default/lib/python3.12/contextlib.py:299 
def open_dataset() at xarray/backends/api.py:480 
def get_backend() at xarray/backends/plugins.py:197 
def open_dataset() at xarray/backends/h5netcdf_.py:432 
def is_remote_uri() at xarray/core/utils.py:672 
def search() at /home/runner/work/xarray-tutorial/xarray-tutorial/.pixi/envs/default/lib/python3.12/re/__init__.py:174 
def open() at xarray/backends/h5netcdf_.py:137 
def update_wrapper() at /home/runner/work/xarray-tutorial/xarray-tutorial/.pixi/envs/default/lib/python3.12/functools.py:35

It is important to note that there are overlaps between the pre-installed backends in xarray. Many of these backends support the same formats (e.g., NetCDF-4), and xarray uses them in a specific order unless a particular backend is specified. For example, when we request the h5netcdf engine, xarray will not attempt to guess the backend. However, it will still check if the URI is remote, which will involve some calls to a context manager. By examining the call stack, we can observe the use of a file handler and a cache, which are crucial for efficiently accessing remote files.

Supported file formats by backend#

The open_dataset() method is our entry point to n-dimensional data with xarray, the first argument we pass indicates what we want to open and is used by xarray to get the right backend and in turn is used by the backend to open the file locally or remote. The accepted types by xarray are:

str: my-file.nc or s3:://my-zarr-store/data.zarr
os.PathLike: Posix compatible path, most of the times is a Pathlib cross-OS compatible path.
BufferedIOBase: some xarray backends can read data from a buffer, this is key for remote access.
AbstractDataStore: This one is the generic store and backends should subclass it, if we do we can pass a “store” to xarray like in the case of Opendap/Pydap

# Listing which backends we have available, if we install more they should show up here.
xr.backends.list_engines()

{'netcdf4': <NetCDF4BackendEntrypoint>
   Open netCDF (.nc, .nc4 and .cdf) and most HDF5 files using netCDF4 in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.NetCDF4BackendEntrypoint.html,
 'h5netcdf': <H5netcdfBackendEntrypoint>
   Open netCDF (.nc, .nc4 and .cdf) and most HDF5 files using h5netcdf in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.H5netcdfBackendEntrypoint.html,
 'scipy': <ScipyBackendEntrypoint>
   Open netCDF files (.nc, .nc4, .cdf and .gz) using scipy in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.ScipyBackendEntrypoint.html,
 'pydap': <PydapBackendEntrypoint>
   Open remote datasets via OPeNDAP using pydap in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.PydapBackendEntrypoint.html,
 'rasterio': <RasterioBackend>,
 'store': <StoreBackendEntrypoint>
   Open AbstractDataStore instances in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.StoreBackendEntrypoint.html,
 'zarr': <ZarrBackendEntrypoint>
   Open zarr files (.zarr) using zarr in Xarray
   Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.ZarrBackendEntrypoint.html}

Trying to access a file on cloud storage (AWS S3)#

Now let’s try to open a file on a remote file system, this will fail and we’ll take a look into why it failed and how we’ll use fsspec to overcome this.

try:
    ds = xr.open_dataset("s3://its-live-data/test-space/sample-data/sst.mnmean.nc")
except Exception as e:
    print(e)

[Errno -72] NetCDF: Malformed or inaccessible DAP2 DDS or DAP4 DMR response: 's3://its-live-data/test-space/sample-data/sst.mnmean.nc'

syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <?xml^ version="1.0" encoding="UTF-8"?><Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>its-live-data.s3-us-west-2.amazonaws.com</Endpoint><Bucket>its-live-data</Bucket><RequestId>ECFZ5FY1RB63QVER</RequestId><HostId>FmtMLwE5ebLyWTOeXlzvgRWlQp7GC8L2JH3/y9BvfH/akWyIBPcpxwGyIjFCvNpODf6ailA3IKI=</HostId></Error>

xarray iterated through the registered backends and netcdf4 returned a "yes, I can open that extension" see: netCDF4_.py#L618. However, the backend doesn’t know how to “talk” to a remote store and thus it fails to open our file.

Supported format + Read from Buffers = Remote access#

Some of xarray’s backends can read and write data to memory, this coupled with fsspec’s ability to abstract remote files allows us to access remote files as if they were local. The following table helps us to identify if a backend can be used to access remote files with fsspec.

Backend	HDF/NetCDF Support	Can Read from Buffer	Handles Own I/O
netCDF4	Yes	No	Yes
scipy	Limited	Yes	Yes
pydap	Yes	No	No
h5netcdf	Yes	Yes	Yes
zarr	No	Yes	Yes
cfgrib	Yes	No	Yes
rasterio	Partial	Yes	No

Can Read from Buffer: Libraries that can read from buffers do not need to open a file using the operating system machinery and they allow the use of memory to open our files in whatever way we want as long as we have a seekable buffer (random access).

Handles Own I/O: Some libraries have self contained code that can handle I/O, compression, codecs and data access. Some engines task their I/O to lower level libraries. This is the case with rasterio that uses GDAL to access raster files. If a Library is in control of its own I/O operations can be easily adapted to read from buffers.

        graph TD
    A["netCDF-4 (.nc, .nc4) and most HDF5 files"] -->|netcdf4| B["Remote Access: No"]
    A -->|h5netcdf| C["Remote Access: Yes"]
    
    D["netCDF files (.nc, .cdf, .gz)"] -->|scipy| E["Remote Access: Yes"]
    
    F["zarr files (.zarr)"] -->|zarr| G["Remote Access: Yes"]

    H["OpenDAP"] -->|pydap| I["Remote Access: Yes"]

Key Takeaways#

fsspec and remote access.

fsspec is a Python library that provides a unified interface to various filesystems, enabling access to local, remote, and cloud storage systems. It supports a wide range of protocols such as http, https, s3, gcs, ftp, and many more. One of the key features of fsspec is its ability to cache remote files locally, improving performance by reducing latency and bandwidth usage.

xarray Backends.

xarray backends offers flexible support for opening datasets stored in different formats and locations. By leveraging various backends along with fsspec we can open, read, and analyze complex datasets efficiently, without worrying about the underlying file format or storage mechanism.

Combining fsspec with xarray

xarray can work with fsspec filesystems to open and cache remote files and use caching strategies to optimize its data transfer.

By leveraging these tools and techniques, you can efficiently manage and process large, remote datasets in a way that optimizes performance and accessibility.

Access Patterns to Remote Data with fsspec

Contents

Access Patterns to Remote Data with fsspec#

xarray backends under the hood#

What are we seeing?#

Supported file formats by backend#

Trying to access a file on cloud storage (AWS S3)#

Supported format + Read from Buffers = Remote access#

Remote Access and File Caching#

fsspec caching implementations.#

simple cache + `open_local()`#

block cache + `open()`#

Reading data from cloud storage#

Key Takeaways#

Access Patterns to Remote Data with fsspec

Contents

Access Patterns to Remote Data with fsspec#

xarray backends under the hood#

What are we seeing?#

Supported file formats by backend#

Trying to access a file on cloud storage (AWS S3)#

Supported format + Read from Buffers = Remote access#

Remote Access and File Caching#

fsspec caching implementations.#

simple cache + open_local()#

block cache + open()#

Reading data from cloud storage#

Key Takeaways#

simple cache + `open_local()`#

block cache + `open()`#