Digital Humanities Research

Provides resources for researching topics, tools, and organizations in the Digital Humanities.

Data Sources for Text Analytics - Audience

This page includes example data sources in the humanities that may be of interest to participants enrolled in the Winter 2021 DH workshop Python for Humanists as well as others looking for textual data to use for DH projects.

UVA Library Data Services

Research Data Services + Sciences: library.virginia.edu/data

Data Discovery and Access: https://library.virginia.edu/data/datasources/

Our licensed data sources of possible interest include the Burney Collection Newspapers: 17th and 18th Century and The Times Literary Supplement Archive files 1902-2011.

Demo Corpora (Text Collections Ready for Use)

This wiki page, maintained by Alan Liu, includes links to demo corpora, which are “sample or toy collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials--e.g., for teaching text analysis, topic modeling, etc.”

http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets

HathiTrust

Find text data from public domain works available for bulk download from HathiTrust at https://www.hathitrust.org/datasets.

The HathiTrust Research Center Extracted Features Dataset (2.0) includes metadata and data elements extracted from over 17 million volumes in HathiTrust, including materials under copyright.

Text Collection Partnership (TCP)

The Text Collection Partnership (TCP) has a GItHub repository for their data collections available at https://github.com/Text-Creation-Partnership.

Data Collections include:

Early English Books Online (EEBO) - Navigations: https://github.com/Text-Creation-Partnership/EEBO-TCP-Collections-Navigations. A project funded by the National Endowment for the Humanities to select, key, and encode EEBO-TCP texts related to the theme of travel and navigation.
Evans Early American Imprints: https://github.com/Text-Creation-Partnership/Evans-TCP
Eighteenth Century Collections Online (ECCO): Over 2,000 texts made available by the ECCO Text Creation Partnership.
https://github.com/Text-Creation-Partnership/ECCO-TCP (in TEI)
https://old.datahub.io/dataset/tcp-ecco-18th-century-texts (in plain text)

Read more about the TCP and its collections, and access raw files at https://textcreationpartnership.org/faq/

British Library

Find the guide for British Library Content for Data mining at https://www.bl.uk/collection-guides/datasets-for-content-mining#.

Includes:

British Library Digital Collections and Data: https://data.bl.uk/, "a creative 'space' developed by the BL Labs team for researchers to download large 'chunks' of the British Library's openly available data and digital collections so that they can experiment with them and develop new innovative projects."
BL Datasets: https://data.bl.uk/bl_labs_datasets/. As of 08/09/20, there are 150 datasets available from the British Library. Collections include:
- Asian and African department (AAS) Card Catalogues (27 datasets)
- C M Taylor Keylogging Data (8 datasets)
- Digitised printed books (18th-19th century) (28 datasets)
- Digitised Hebrew Manuscripts (22 datasets)
- Ground Truth Transcriptions (3 datasets)
- India Office Medical Archives (3 datasets)
- Italian Academies (2 datasets)
- Judicial Committee of the Privy Council: Linked Appeals Data (1 dataset)
- Linked Open British National Bibliography (3 datasets)
- Maps, plans and topographical views (1 dataset)
- Pelagios Project (7 datasets)
- Quarterly Lists (2 datasets)
- "Single Sheet" thematic collections (8 datasets)
- SherlockNet (7 datasets)
- UK Web Archive (5 datasets)
- UK Doctoral Thesis Metadata from ETHOS (4 datasets)
- 3D representative models (21 datasets)
19th Century Printed Books: https://data.bl.uk/digbks/ includes over 60K volumes published between 1789 and 1900. Subjects range from philosophy and history to literature and poetry.

Library of Congress Data

LC for Robots: https://labs.loc.gov/lc-for-robots/. "Explore the many ways the Library of Congress provides machine-readable access to its digital collections."

APIs:

Loc.gov JSON API
Chronicling America API
American Archive of Public Broadcasting APIs
World Digital Library APIs

Bulk data:

Bulk data for Congress.gov bills, bill status, and bill summaries
MARC records - bibliographic information for most of the Library’s collections. 25 million records are available for exploration in UTF-8, MARC8, and XML formats.
Sample MARC data set and ReadMe file
Chronicling America Bulk OCR Data – text only
Chronicling America Bulk Data – image, metadata, and OCR text batches
Dot Gov Datasets – audio, pdfs, and tabular data from .gov domains
Web Cultures Datasets – memes and gifs from the American Folklife Center's Web Cultures Web Archive

U.S. Government Data

Congress.gov: Bill Status Bulk Data

Govinfo.gov: Bulk Data Repository. Includes multiple datasets:

Congressional Bills
Bill Status
Bill Summaries
Commerce Business Daily
Code of Federal Regulations (Annual Edition)
Electronic Code of Federal Regulations
Federal Register
United States Government Manual
House Rules and Manual
Privacy Act Issuances
Public and Private Laws
Public Papers of the Presidents of the United States
Supreme Court Decisions 1937-1975
Statutes at Large

GovInfo.gov - Featured Content. Browse interesting sources at GovInfo.gov, such as Presidential Inaugural Addresses and a collection of documents in memory of Ruth Bader Ginsburg. Also browse the complete (at least, born digital) U.S. federal government document collection by category.

Supreme Court: Oral Argument Transcripts, Opinions of the Court.

Department of Justice: News API.

Data.gov: Listing of APIs. Note: This list seems incomplete.

More Collections of Textual Datasets

Modern English Collection: Public domain texts digitized by the UVA library. Browse titles: https://web.archive.org/web/20010201164600/http://etext.lib.virginia.edu:80/modeng/modeng0.browse.html. Contact Chris Ruotolo if you are interested in downloading texts in this collection.

Documenting the Now: https://catalog.docnow.io/. "The DocNow Catalog is a collectively curated listing of Twitter datasets. Public datasets are shared as Tweet IDs, which can be hydrated back into full datasets using our Hydrator desktop application."

Project Gutenberg: https://www.gutenberg.org/. Free ebooks in the public domain. Not just fiction. Browse Bookshelf for categories. Some Project Gutenberg texts are available as part of the Natural Language Toolkit (NLTK), see the free online NLTK Book for how to use Python and NLTK o access the included Gutenberg corpus as well as other corpuses including the Brown Corpus, the Reuters Corpus, and the Inaugural Address Corpus. You can find a library to interface with Project Gutenberg using Python at: https://pypi.org/project/Gutenberg/ (online there are many tutorials and posts on how to interact with Project Gutenberg).

Ripper Press Reports Dataset: https://digitalhumanities.wlu.edu/blog/2016/12/12/ripper-dataset/. Created by Brandon Walsh, UVA Scholars’ Lab. "The dataset features the full texts of 2677 newspaper articles between the years of 1844 and 1988 that reference the Whitechapel murders by Jack the Ripper. While the bulk of the texts are, in fact, contemporary to the murders, a handful of them skew closer to the present as press reports for contemporary crimes look back to the infamous case. The wide variety of sources available here gives a sense of how the coverage of the case differed by region, date, and publication."

Data is Plural: Data is Plural - Structured Archive. A weekly newsletter highlighting interesting datasets. Most are numeric or categorical datasets, but it does list some textual datasets, which are highlighted here:

Airplane confidential. (Text narratives of flight safety, from NASA.)
Congressional Research Service, in bulk.
Drug patents and exclusivity, from the FDA
One million comic book panels.
Supreme Court Transcripts, this time from Oyez.org
Xkcd transcripts
Chyrons
Index Thomisticus
An obviously perfect dataset (about sarcasm)
The State of the State of the States (State of the State addresses given by governors)
Foreign lobbyists
Drama. (Drama Corpora Project - 800 plays in different languages)
Euro-bank speeches
Environmental treaties
EU laws
Coronavirus research papers
Pandemic-era economic policies
Privacy policies
Six million parliamentary speeches (from 9 countries)
Poems by kids
Police violence at the BLM protests
Tech’s BLM statements
The Green Books
New policing bills
House work (House of Representatives Job and Internship Announcements)

Data Repositories

Dataverse: allows scholars to share, publish, and archive their data, as well as find and cite data across all research fields.

UVA's Dataverse: https://dataverse.lib.virginia.edu/
Harvard's Dataverse: https://dataverse.harvard.edu/. Includes data from many institutions.

Wikidata: https://www.wikidata.org/

What is an API?

Python API Tutorial: Getting Started with APIs
This article from Dataquest is helpful for understanding what an API is and how to use a very simple API in Python.