The Center for Digital Humanities at Princeton University has recently received a grant from the US-based National Endowment for the Humanities to train humanities scholars in the development of natural language processing (NLP) tools for lesser-resourced languages.
The “New Languages for NLP” workshop series will be hosted at Princeton in 2021 and 2022 in cooperation with the Digital Research Infrastructure for Arts and Humanities (DARIAH) and the Library of Congress LC Labs.
“Humanities scholars do not only care about patterns and frequencies, we also care about what is unique, unusual and strange,” said DARIAH Director Toma Tasovac. “But discovering either — mediocrity or weirdness — in textual corpora is much more difficult if you don’t have access to NLP tools for the particular language variety you’re working on. Which, from the outset, puts some scholars — and some languages — at a great disadvantage.”
Participants will work over the course of a year — between June 2021 and May 2022 — and will meet for three intensive workshops where they will learn how to annotate linguistic data and train statistical language models using cutting-edge NLP tools.
They will also learn best practices in project and research data management as well as join discussions with leaders in the fields of multilingual NLP and DH. Furthermore, they will advance their own research projects by creating, employing and interrogating text-analysis tools and methods, while increasing much-needed linguistic diversity in the field of NLP.

“In addition to helping a number of researchers create and adopt NLP tools, we will be creating a body of knowledge and learning resources to share with the wider community via our educational portal DARIAH-Campus,” Tasovac said. “Knowledge is brittle and, especially in this age of information overload, it is essential that we capture it rather than let it get buried under an avalanche of digital noise. As a research infrastructure, we’re delighted that we can work together with Princeton on delivering, preserving and sustaining the outputs of this project.”
NLP has revolutionized our ability to analyze texts at scale. However, of the world’s more than 7,500 languages, the major NLP resources only support eighty-five. While large linguistic datasets exist for high-resource languages such as English or German, text mining, topic modeling and other methods of computational text analysis are unavailable for the vast majority of languages — especially those that are minority, regional or endangered.
“Humanities scholars are rightly suspicious of the so-called “black box” tools — the kind of tools which work ‘automagically’ but which conceal their own methodology,” Tasovac said. “I very much hope that the participants will come out of our workshops not only with useful models for their own work, but with a better understanding of how NLP tools work, what their advantages are, as well as their limitations. I also hope that we’ll inspire more cooperation between humanities and NLP scholars in general. We would all have something to gain from that.”
A Call for Proposals will be published in early November. For more info, check out the project website.