Data Statements

Data statements provide essential information about the characteristics of datasets, including but not limited to the curation rationale and data sources. The information contained in data statements can be used to help (1) mitigate the harms caused by bias in the dataset (such as a mismatch between training datasets and contexts where systems are deployed) and (2) create a more inclusive data catalog, by identifying gaps. While developed with language data types, data statements could be produced for a wide range of data types with adjustments to account for the unique characteristics of the specific data type.

Data Statements for Natural Language Processing

This webpage contains information about data statements for language datasets used in natural language processing systems. The schema elements have been honed to the particular characteristics of language datasets, including speech context, speaker demographic, and annotator demographic. The most recent schema elements (Version 2) are listed here. Detailed definitions of the elements are provided in A Guide for Writing Data Statements, linked below, along with rationale and suggestions for writing each element as well as general best practices. A table summarizing the changes from Version 1 to Version 2 can be found under Other Resources below.

Schema Elements Version 2

  1. HEADER
  2. EXECUTIVE SUMMARY
  3. CURATION RATIONALE
  4. DOCUMENTATION FOR SOURCE DATASETS
  5. LANGUAGE VARIETIES
  6. SPEAKER DEMOGRAPHIC
  7. ANNOTATOR DEMOGRAPHIC
  8. SPEECH SITUATION AND TEXT CHARACTERISTICS
  9. PREPROCESSING AND DATA FORMATTING
  10. CAPTURE QUALITY
  11. LIMITATIONS
  12. METADATA
  13. DISCLOSURE AND ETHICAL REVIEW
  14. OTHER
  15. GLOSSARY

Writing Data Statements

A Short History of Data Statements

Data statements were first conceptualized in 2017 by Emily M. Bender and Batya Friedman at the University of Washington where they were initially developed for language datasets used in natural language processing systems. The first version of data statements was published in 2018 in Transactions of the Association for Computational Linguistics and presented at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). The next two years saw significant interest and uptake. With the goals of supporting broader uptake and learning how to make data statements a suitable practice across different research and institutional contexts, in 2020 Emily M. Bender, Batya Friedman, and Angelina McMillan-Major organized a workshop at the 12th Language Resources and Evaluation Conference. The results of this workshop led to an updated schema (Version 2), a set of best practices, and A Guide for Writing Data Statements all released in 2021.

Data statements are a part of an emerging landscape for toolkits about documentation for transparency in artificial intelligence systems, including Datasheets for Datasets, Model Cards for Model Reporting, Dataset Nutrition Labels, Nutrition Labels for Data and Models, FactSheets, and Data Cards.