Data Tidying

Data Tidying#

Array data that are represented by Xarray objects are often multivariate, multi-dimensional, and very complex. Part of the beauty of Xarray is that it is adaptable and scalable to represent a large number of data structures. However, this can also introduce difficulty (especially for learning users) in arriving at a workable structure that will best suit one’s analytical needs.

A brief primer on tidy data#

Tidy data was developed by Hadley Wickham for tabular datasets in the R programming language. Many resources comprehensively explain this concept and the ecosystem of tools built upon it. Below is a very brief explanation:

Data tidying is the process of structuring datasets to facilitate analysis. Wickham writes: “…tidy datasets are all alike, but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning)” (Wickham, 2014).

Tidy data principles for tabular datasets#

The concept of tidy data was developed by Hadley Wickham in the R programming language, and is a set of principles to guide facilitating tabular data for analysis.

{attribution=”Wickham, 2014”}

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”

Wickham defines three core principles of tidy data for tabular principles. They are:

Each variable forms an observation
Each observation forms a row
Each type of observational unit forms a table

Imagining a ‘tidy data’ framework for gridded datasets#

Common use-case: Manipulating individual observations to an x-y-time datacube#

Data downloaded or accessed from DAACs and other providers is often (for good reason) separated into temporal observations or spatial subsets. This minimizes the services that must be provided for different datasets and allows the user to access just the material that they need. However, most workflows will involve some sort of spatial and/or temporal investigation of an observable, which will usually require the analyst to arrange individual files into spatial mosaics and/or temporal cubes. In addition to being a source of duplicated effort and work, these steps also introduce decision-points that can be stumbling blocks for newer users. We hope a tidy framework for xarray will streamline the process of preparing data for analysis by providing specific expectations of what ‘tidied’ datasets look like as well as common patterns and tools to use to arrive at a tidy state.

Tidy data principles for Xarray data structures#

These are guidelines to keep in mind while you are organizing your data. For detailed definitions of the terms mentioned below (and more), check out Xarray’s Terminology page.

1. Dimensions

Minimize the number of dimensional coordinates

2. Coordinates

Non-dimensional coordinates can be numerous. Each should exist along one or multiple dimensions

3. Data Variables

Data variables should be observables rather than contextual. Each should exist along one or multiple dimensions.

4. Contextual information (metadata)

Metadata should only be stored as an attribute if it is static along the dimensions to which it is applied.
If metadata is dynamic, it should be stored as a coordinate variable.
Metadata attrs should be added such that dataset is self-describing (following CF-conventions)

5. Variable, attribute naming

Wherever possible, use cf-conventions for naming
Variable names should be descriptive
Variable names should not contain information that belongs in a dimension or coordinate (ie. information stored in a variable name should be reduced to only the observable the variable describes.

6. Make us of & work within the framework of other tools

Specification systems such as CF and STAC, and related tools such as Open Data Cube, PySTAC, cf_xarray,stackstac and more make tidying possible and smoother, especially with large, cloud-optimized datasets.

Other guidelines and rules of thumb#

Avoid storing important data in filenames
Non-descriptive variable names can create + perpetuate confusion
Missing coordinate information makes datasets harder to use
Elements of a dataset’s ‘shape’/structure can sometimes be embedded in variable names; this will complicate subsequent analysis

Contributing#

We would love your help and engagement on this project! If you have a dataset that you’ve worked with that felt particularly messy, or one with steps you find yourself thinking back to as you work with new datasets, consider submitting it as an example! If you have input on tidy principles, please feel free to raise an issue.