Boolean Indexing & Masking#

Learning Objectives#

  • The concept of boolean masks

  • Dropping/Masking data using where

  • Using isin for creating a boolean mask

Overview#

Boolean masking, known as boolean indexing, is a functionality in Python that enables the filtering of values based on a specific condition.

A boolean mask refers to a binary array or a boolean-valued (True/False) array that is used as a filter to select specific elements from another array. The boolean mask acts as a criterion or condition, where each element in the mask corresponds to an element in the target array. An element in the target array is selected when the corresponding mask value is True.

Xarray provides different capabilities to allow filtering and boolean indexing. In this notebook, we will learn more about it.

First, let’s import the packages needed for this notebook:

import cartopy.crs as ccrs
import numpy as np
import xarray as xr
from matplotlib import pyplot as plt
import matplotlib as mpl

xr.set_options(display_expand_attrs=False)
np.set_printoptions(threshold=10, edgeitems=2)

In this tutorial, we’ll use the Regional Arctic System Mode (RASM) example dataset

ds = xr.tutorial.load_dataset("rasm").isel(time=0)
ds
<xarray.Dataset>
Dimensions:  (y: 205, x: 275)
Coordinates:
    time     object 1980-09-16 12:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
    Tair     (y, x) float64 nan nan nan nan nan ... 27.91 27.02 26.56 26.73
Attributes: (11)

In this dataset, the logical coordinates are x and y, while the physical coordinates are xc and yc, which represent the latitudes and longitude of the data.

print(ds.xc.attrs)
print(ds.yc.attrs)
{'long_name': 'longitude of grid cell center', 'units': 'degrees_east'}
{'long_name': 'latitude of grid cell center', 'units': 'degrees_north'}
da = ds.Tair
da
<xarray.DataArray 'Tair' (y: 205, x: 275)>
array([[        nan,         nan, ...,         nan,         nan],
       [        nan,         nan, ...,         nan,         nan],
       ...,
       [        nan,         nan, ..., 26.80261869, 27.08603517],
       [        nan,         nan, ..., 26.56473862, 26.73064933]])
Coordinates:
    time     object 1980-09-16 12:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Attributes: (4)

Masking with where()#

Indexing methods on Xarray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked.

By applying .where(), the original data’s shape is maintained, with values masked based on a Boolean condition. Values that satisfy the condition (True) are returned unchanged, while values that do not meet the condition (False) are replaced with a predefined value.

In the example below, we replace all nan values with -9999:

# Let's replace the missing values (nan) with some placeholder
ds.Tair.where(ds.Tair.notnull(), -9999)
<xarray.DataArray 'Tair' (y: 205, x: 275)>
array([[-9999.        , -9999.        , ..., -9999.        ,
        -9999.        ],
       [-9999.        , -9999.        , ..., -9999.        ,
        -9999.        ],
       ...,
       [-9999.        , -9999.        , ...,    26.80261869,
           27.08603517],
       [-9999.        , -9999.        , ...,    26.56473862,
           26.73064933]])
Coordinates:
    time     object 1980-09-16 12:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Attributes: (4)

As you can see, in the example above .where() preserved the shape of the original data by masking the values with a boolean condition.

Most uses of .where() check whether or not specific data values are less than or greater than a constant value.

The data values specified in the boolean condition of .where() can be any of the following:

  • a DataArray

  • a Dataset

  • a function

In the following example, we make use of .where() to mask all temperature below 0°C.

da_masked = da.where(da >= 0)

# -- making both plots for comparison:
fig, axes = plt.subplots(ncols=2, figsize=(15, 5))

# -- for reference (without masking):
da.plot(ax=axes[0], vmin=-30, vmax=30, cmap=mpl.cm.RdBu_r)

# -- masked DataArray
da_masked.plot(ax=axes[1], vmin=-30, vmax=30, cmap=mpl.cm.RdBu_r);
../../_images/589755694fd3ec75b791773e0df831b37ff1ca8c02d5da1fe58754b513799585.png

Tip

By default Xarray set the masked values to nan. But as we saw in the first example, we can set it to other values too.

Exercise 16

Using the syntax you’ve learned so far, mask all the points with latitudes above 60° N.

# write your answer here!

As mentioned above, by default where maintains the original size of the data. You can use the option drop=True to clip coordinate elements that are fully masked:

da_masked = da.where(da.yc > 60, drop=True)
da_masked.plot();
../../_images/3caedffe716b9417f80cf4f2743e06d9e709d58a7fd6402afcdf933f353f4f01.png

Please note that in this dataset, the variables xc (longitude) and yc (latitude) are two-dimensional scalar fields.

When we plotted the data variable Tair, by default we get the logical coordinates (i.e. x and y) as we show in the example above.

In order to visualize the data on a conventional latitude-longitude grid, we can take advantage of Xarray’s ability to apply cartopy map projections.

plt.figure(figsize=(14, 6))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
ds.Tair.plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(), x="xc", y="yc", add_colorbar=False)
ax.coastlines()
ax.set_ylim([20, 90]);
../../_images/67ef7a4faa891fce7a3e769e63e4c4c9b8a74076aa9448a8a6e665be0878a91a.png

Using where with Multiple Conditions#

In Xarray’s .where() function, boolean conditions can be combined using logical operators. The bitwise and operator (&) and the bitwise or operator (|) are relevant in this case. This allows for specifying multiple masking conditions within a single .where() statement.

We can select data for one specific region using bound boxes. For example, here we want to access data over a region over Alaska :

# -- define a region
min_lon = 190
min_lat = 55
max_lon = 230
max_lat = 85

First we have to create our boolean masks:

mask_lon = (ds.xc >= min_lon) & (ds.xc <= max_lon)
mask_lat = (ds.yc >= min_lat) & (ds.yc <= max_lat)

Next, we can use the boolean masks for filtering data for that region:

da_masked = da.where(mask_lon & mask_lat, drop=True)

da_masked.plot();
../../_images/dd84600893222ce0b81996daa1a515cf26da52d6fc596e5156d5eb8487e03a18.png
plt.figure(figsize=(5, 5))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
da_masked.plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(), x="xc", y="yc", add_colorbar=False)
ax.coastlines()
ax.set_ylim([50, 80])
ax.set_xlim([-180, -120]);
/home/runner/micromamba/envs/xarray-tutorial/lib/python3.11/site-packages/cartopy/io/__init__.py:241: DownloadWarning: Downloading: https://naturalearth.s3.amazonaws.com/50m_physical/ne_50m_coastline.zip
  warnings.warn(f'Downloading: {url}', DownloadWarning)
../../_images/dab50bb6c8da498ca10cc88d6d1f4b8b0e22a9646b0d0813d78a549ab978350a.png

Exercise#

If we load air temperature dataset from NCEP, we could use sel method for selecting a region:

Exercise 17

If we load air temperature dataset from NCEP, we could use sel method for selecting a region:

ds = xr.tutorial.open_dataset("air_temperature")
ds_region = ds.sel(lat=slice(75,50), lon=slice(250,300))

ds_region.air.plot();

Can you use a similar method as above using sel to crop a region using the RASM dataset? Why?

Using xr.where with a Function#

We can use xr.where with a function as a condition too. For example, here we want to convert temperature to Kelvin and find if temperature is greater than 280 K:

# Define a function to use as a condition
def is_greater_than_threshold(x, threshold=300):
    # function to convert temp to K
    # and compare with threshold
    x = x + 273.15
    return x > threshold


# Apply the condition using xarray.where()
masked_data = xr.where(is_greater_than_threshold(da, 280), da, 0)

masked_data.plot()
<matplotlib.collections.QuadMesh at 0x7fa8bc912090>
../../_images/78668ecd6d6813cde8056b55d253275cc81ed76a1a6c6e1eab214dfbc58fa0c6.png

Selecting Values with isin#

To check whether elements of an xarray object contain a single object, you can compare with the equality operator == (e.g., arr == 3).

To check multiple values, we use isin():

Here is a simple example:

x_da = xr.DataArray([1, 2, 3, 4, 5], dims=["x"])

# -- select points with values equal to 2 and 4:
x_da.isin([2, 4])
<xarray.DataArray (x: 5)>
array([False,  True, False,  True, False])
Dimensions without coordinates: x

Tip

isin() works particularly well with where() to support indexing by arrays that are not already labels of an array.

For example, we have another DataArray that displays the status flags of the data-collecting device for our data.

Here, flags with value 0 and -1 signifies the device was functioning correctly, while 0 indicates a malfunction, implying that the resulting data collected may not be accurate.

flags = xr.DataArray(np.random.randint(-1, 5, da.shape), dims=da.dims, coords=da.coords)
flags
<xarray.DataArray (y: 205, x: 275)>
array([[ 3, -1, ...,  0,  2],
       [-1,  2, ...,  0,  1],
       ...,
       [-1,  1, ..., -1,  3],
       [ 1,  0, ...,  3,  3]])
Coordinates:
    time     object 1980-09-16 12:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x

Now, we want to only see the data for points where out measurement device is working correctly:

da_masked = da.where(flags.isin([1, 2, 3, 4, 5]), drop=True)
da_masked.plot();
../../_images/f26a294054762a24ef619f5650cf9fe12f419c202b80cb6c0e717f266524869a.png

Warning

Please note that when done repeatedly, this type of indexing is significantly slower than using sel().

Use sel instead of where as much as possible.

Additional Resources#