Data Statements | Tech Policy Lab

Data statements provide essential information about the characteristics of datasets, including but not limited to the curation rationale and data sources. The information contained in data statements can be used to help (1) mitigate the harms caused by bias in the dataset (such as a mismatch between training datasets and contexts where systems are deployed) and (2) create a more inclusive data catalog, by identifying gaps. While developed with language data types, data statements could be produced for a wide range of data types with adjustments to account for the unique characteristics of the specific data type.

Data Statements for Natural Language Processing

This webpage contains information about data statements for language datasets used in natural language processing systems and other digital language projects. The schema elements have been honed to the particular characteristics of language datasets, including speech context, speaker demographic, and annotator demographic. The most recent schema elements (Version 3) are listed here. Detailed definitions of the elements are provided in Creating and Documenting Language Datasets with Data Statements, linked below, along with rationale and suggestions for writing each element as well as general best practices. Version 3 contains two new schema elements, 14 Distribution and 15 Maintenance; two renamed schema elements, 6 Language User Demographic and 8 Linguistic Situation and Text Characteristics; and updated descriptions throughout to better support the creation of dataset documentation with language communities.

Schema Elements Version 2

HEADER
EXECUTIVE SUMMARY
CURATION RATIONALE
DOCUMENTATION FOR SOURCE DATASETS
LANGUAGE VARIETIES
LANGUAGE USER DEMOGRAPHIC
ANNOTATOR DEMOGRAPHIC
LINGUISTIC SITUATION AND TEXT CHARACTERISTICS
PREPROCESSING AND DATA FORMATTING
CAPTURE QUALITY
LIMITATIONS
METADATA
DISCLOSURE AND ETHICAL REVIEW
DISTRIBUTION
MAINTENANCE
OTHER
GLOSSARY

Writing Data Statements

GUIDE

(PDF)
(PDF printer-friendly)
(Markdown)

TEMPLATE

(Markdown)
(Overleaf)
(GoogleDoc)

Other Resources

Schema Version 1
Dataset documentation
Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science (Bender & Friedman, 2018)

Schema Version 2
Dataset documentation refined by scientific community engagement
A Guide for Writing Data Statements for Natural Language Processing (Bender, Friedman, & McMillan-Major, 2021)
Data Statements: From Technical Concept to Community Practice (McMillan-Major, Bender, & Friedman, 2023)

Schema Version 3
Dataset documentation and creation with best practices for language community dataset development
Language Dataset Documentation Design: Learning from Deaf and Indigenous Communities (McMillan-Major, 2023)

Data statement samples:

Data Statement for the Public DGS Corpus (v3 available)
Data Statement of the Corpus of Basque Simplified Texts (v3 available)
Data Statement for MuST-SHE (v2)

Table for converting from Schema Version 1 to Schema Version 2
LREC 2020 Workshop ‘Data Statements for NLP: Towards Best Practices’

A Short History of Data Statements

Data statements were first conceptualized in 2017 by Emily M. Bender and Batya Friedman at the University of Washington where they were initially developed for language datasets used in natural language processing systems. The first version of data statements was published in 2018 in Transactions of the Association for Computational Linguistics and presented at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). The next two years saw significant interest and uptake. With the goals of supporting broader uptake and learning how to make data statements a suitable practice across different research and institutional contexts, in 2020 Emily M. Bender, Batya Friedman, and Angelina McMillan-Major organized a workshop at the 12th Language Resources and Evaluation Conference. The results of this workshop led to an updated schema (Version 2), a set of best practices, and A Guide for Writing Data Statements all released in 2021.

Data statements schema Version 2 and Bender, Friedman, and McMillan-Major’s reflections on the documentation development process were published in 2023 in the first issue of the Association for Computing Machinery (ACM) Journal of Responsible Computing. McMillan-Major continued to develop data statements in her dissertation work by shifting the perspective of data statements to include prospective dataset documentation and incorporating language communities as collaborative partners in the dataset curation and documentation process. McMillan-Major and Bender then refined McMillan-Major’s dissertation work into the current version of the schema, Version 3.

Data statements are a part of an emerging landscape for toolkits about documentation for transparency in data-driven systems, including Datasheets for Datasets, Model Cards for Model Reporting, Dataset Nutrition Labels, Nutrition Labels for Data and Models, FactSheets, and Data Cards.

Acknowledgements

Data statements were developed at the University of Washington by faculty and students from the Department of Linguistics, Information School, Tech Policy Lab, and Value Sensitive Design Lab. This work was supported financially by the UW Tech Policy Lab and the Frances and Howard Nostrand Endowed Professorship. We gratefully acknowledge the intellectual contributions and support of Zeerak Talat and Leon Derczynski as well as the LREC workshop participants including Luciana Benotti, Bonaventure F. P. Dossou, Chris Emezue, Itziar Gonzalez-Dios, Amy Isard, Neelam Pirbhai-Jetha, Surangika Ranathunga, Beatrice Savoldi, Marc Schulder, Benjamin Frey, Julie A. Hochgesang, Rose Stamp, Santiago Esteban, and Merve Ünlü Menevşe, and many others. The visual design of this web page and the data statements guides reflect the talent and dedication of Elias Greendorfer.

Contact

We’d love to hear from you! If you’re writing or using data statements in research, development, community work or teaching, or have questions or ideas you’d like to share, please let us know.

datastatements@uw.edu