Data Statements for NLP: Towards Best Practices

May 11 – 13, 2020

Virtual Workshop, 11-13 May 2020

We’ve moved online!

This workshop was originally scheduled to be collocated with LREC 2020 in Marseille, France. We have moved to a virtual workshop, to be held 2-4pm GMT on 11-13 May 2020. Registration for this online workshop is now closed.

Call for Participation

We invite participants who are currently developing NLP datasets to join us for a one-day working meeting at LREC 2020 to develop data statements for their datasets and develop and refine best practices for data statement creation. In this open collaboration session, participants will develop data statements (Bender & Friedman 2018) for specific datasets, and in the process refine a set of best practices for creating data statements. Specifically, workshop participants will: (1) be introduced to the concept, structure, and uses of data statements; (2) draft a data statement for the dataset(s) they brought to the workshop; (3) work in small groups to critique and refine their data statements; and (4) reflect on best practices for writing and disseminating data statements.

This event will be organized differently from typical workshops. It is an open collaboration session providing a structured opportunity for a diverse range of participants in our community to help shape and codify best practices. The deliverables from this workshop will be (a) data statements for each participants’ data set and (b) a preliminary best practices document. These will be disseminated online, together with the overview materials provided by the workshop organizers, with the data statements providing examples illustrating the results of following the preliminary best practices.

There will be no reviewing process ahead of this workshop, nor any proceedings. All participants are welcome, and we especially encourage attendance by people who are currently developing datasets for NLP. We have a small amount of funding available to support participation in this workshop. The application for that funding is January 15, 2020. For details, see “financial support” below.

We will work towards best practices for creating data statements, exploring questions like the following:

How can the information required be efficiently collected?
What steps can be taken in the planning for a dataset to facilitate the collection of relevant metadata about speakers and annotators?
What heuristics are there for writing data statements that are concise and informative?
How can we incorporate material from institutional review board/ethics committee applications into the data statement schema?
How can we best settle on an appropriate level of detail given privacy concerns, especially for small or vulnerable populations?
How can we produce data statements for older datasets that predate this practice?
Finally, how can data statements be incorporated into metadata already associated with data sets, such as is called for by the CLARIN or META-SHARE schemas?

To ensure that the best practices developed are as broadly applicable as possible, we especially encourage participation from developers of datasets for low-resource languages and/or dataset developers from countries not well represented at major NLP conferences.

Financial Support

In order for these best practices to be responsive to the needs of researchers around the world, and not just those in the most well-resourced communities, it is critical that they be designed with a broad range of input. We have already secured funding to bring two invited participants from underrepresented communities to LREC to participate in this workshop plus the main conference, and are currently seeking additional funding. To be considered for this support, please email the following information to ebender-at-uw.edu by January 15, 2020:

Name, country of residence, affiliation
A brief description of a current or near-future dataset creation project you are involved with for which you’d like to work on a data statement at the workshop (including language(s) in the dataset, intended use case, a brief description of any annotations provided, and other details you would like to share)
In what ways would your participation in this workshop broaden the perspectives likely to be represented at our workshop and at LREC?

Workshop Organizers

Emily M. Bender, University of Washington, Department of Linguistics

Batya Friedman, University of Washington, Information School

Angelina McMillan-Major, University of Washington, Department of Linguistics

Sample Data Statements

This page hosts links to data statements developed during our workshop. Please browse these data statements for ideas about how to create your own as well as information about the resources they document.

Corpus of Basque Simplified Texts: Data Statement; Data Set
Fon-French Neural Machine Translation: Data Statement
GV-Yorùbá-NER: Data Statement; Data Set
Ilhan Omar Islamophobia Data Set: Data Statement
Mauritian ‘Sirandanes’ and Proverbs: Data Statement
MuST-SHE: Data Statement; Data Set
Public DGS Corpus: Data Statement; Data Set
Royal Society Corpus Version 4.0: Data Statement; Data Set
STEM-ECR Corpus: Data Statement; Data Set

Here is the worksheet we created for workshop participants to use in making their data statements:

As a Google Doc
In markdown (converted by Leon Derczynski)

Tech Policy Lab

Events