Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Digital Humanities Research

Provides resources for researching topics, tools, and organizations in the Digital Humanities.

This page includes example data sources in the humanities that may be of interest to participants enrolled in the Winter 2021 DH workshop Python for Humanists as well as others looking for textual data to use for DH projects.

UVA Library Data Services

Research Data Services + Sciences: data.library.virginia.edu/

Data Discovery and Access: https://data.library.virginia.edu/datasources/

Our licensed data sources of possible interest include the Burney Collection Newspapers: 17th and 18th Century and The Times Literary Supplement Archive files 1902-2011. 

Demo Corpora (Text Collections Ready for Use)

This wiki page, maintained by Alan Liu, includes links to demo corpora, which are “sample or toy collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials--e.g., for teaching text analysis, topic modeling, etc.”  

http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets

HathiTrust

HathiTrust Research Center logo

Find text data from public domain works available for bulk download from HathiTrust at https://www.hathitrust.org/datasets.

The HathiTrust Research Center Extracted Features Dataset (2.0) includes metadata and data elements extracted from over 17 million volumes in HathiTrust, including materials under copyright. 

Text Collection Partnership (TCP)

The Text CoText Creation Partnership logollection Partnership (TCP) has a GItHub repository for their data collections available at https://github.com/Text-Creation-Partnership.

 

 

Data Collections include: 

Read more about the TCP and its collections, and access raw files at https://textcreationpartnership.org/faq/ 

British Library

British Library Logo
Find the guide for British Library Content for Data mining at https://www.bl.uk/collection-guides/datasets-for-content-mining#

 

Includes:

  • British Library Digital Collections and Data: https://data.bl.uk/, "a creative 'space' developed by the BL Labs team for researchers to download large 'chunks' of the British Library's openly available data and digital collections so that they can experiment with them and develop new innovative projects."
  • BL Datasets: https://data.bl.uk/bl_labs_datasets/​​​​​​​. As of 08/09/20, there are 150 datasets available from the British Library. Collections include:
    • Asian and African department (AAS) Card Catalogues (27 datasets)
    • C M Taylor Keylogging Data (8 datasets)
    • Digitised printed books (18th-19th century) (28 datasets)
    • Digitised Hebrew Manuscripts (22 datasets)
    • Ground Truth Transcriptions (3 datasets)
    • India Office Medical Archives (3 datasets)
    • Italian Academies (2 datasets)
    • Judicial Committee of the Privy Council: Linked Appeals Data (1 dataset)
    • Linked Open British National Bibliography (3 datasets)
    • Maps, plans and topographical views (1 dataset)
    • Pelagios Project (7 datasets)
    • Quarterly Lists (2 datasets)
    • "Single Sheet" thematic collections (8 datasets)
    • SherlockNet (7 datasets)
    • UK Web Archive (5 datasets)
    • UK Doctoral Thesis Metadata from ETHOS (4 datasets)
    • 3D representative models (21 datasets)
  • 19th Century Printed Books: https://data.bl.uk/digbks/ includes over 60K volumes published between 1789 and 1900. Subjects range from philosophy and history to literature and poetry. 

Library of Congress Data

Library of Congress Labs logo  

 

LC for Robots: https://labs.loc.gov/lc-for-robots/. "Explore the many ways the Library of Congress provides machine-readable access to its digital collections."

APIs:

  • Loc.gov JSON API
  • Chronicling America API
  • American Archive of Public Broadcasting APIs
  • World Digital Library APIs

Bulk data:

  • Bulk data for Congress.gov bills, bill status, and bill summaries
  • MARC records - bibliographic information for most of the Library’s collections. 25 million records are available for exploration in UTF-8, MARC8, and XML formats.
  • Sample MARC data set and ReadMe file
  • Chronicling America Bulk OCR Data – text only
  • Chronicling America Bulk Data – image, metadata, and OCR text batches
  • Dot Gov Datasets – audio, pdfs, and tabular data from .gov domains
  • Web Cultures Datasets – memes and gifs from the American Folklife Center's Web Cultures Web Archive

U.S. Government Data

Congress.gov: Bill Status Bulk Data

Govinfo.gov: Bulk Data Repository. Includes multiple datasets:

  • Congressional Bills
  • Bill Status
  • Bill Summaries
  • Commerce Business Daily
  • Code of Federal Regulations (Annual Edition)
  • Electronic Code of Federal Regulations
  • Federal Register
  • United States Government Manual
  • House Rules and Manual
  • Privacy Act Issuances
  • Public and Private Laws
  • Public Papers of the Presidents of the United States
  • Supreme Court Decisions 1937-1975
  • Statutes at Large

GovInfo.gov - Featured Content. Browse interesting sources at GovInfo.gov, such as Presidential Inaugural Addresses and a collection of documents in memory of Ruth Bader Ginsburg. Also browse the complete (at least, born digital) U.S. federal government document collection by category.

Supreme Court: Oral Argument Transcripts, Opinions of the Court.

Department of Justice: News API.

Data.gov: Listing of APIs. Note: This list seems incomplete.

More Collections of Textual Datasets

Modern English Collection: Public domain texts digitized by the UVA library. Browse titles: https://web.archive.org/web/20010201164600/http://etext.lib.virginia.edu:80/modeng/modeng0.browse.html. Contact Chris Ruotolo if you are interested in downloading texts in this collection. 

Documenting the Now: https://catalog.docnow.io/. "The DocNow Catalog is a collectively curated listing of Twitter datasets. Public datasets are shared as Tweet IDs, which can be hydrated back into full datasets using our Hydrator desktop application."

Project Gutenberghttps://www.gutenberg.org/. Free ebooks in the public domain.  Not just fiction. Browse Bookshelf for categories. Some Project Gutenberg texts are available as part of the Natural Language Toolkit (NLTK), see the free online NLTK Book for how to use Python and NLTK o access the included Gutenberg corpus as well as other corpuses including the Brown Corpus, the Reuters Corpus, and the Inaugural Address Corpus. You can find a library to interface with Project Gutenberg using Python at: https://pypi.org/project/Gutenberg/ (online there are many tutorials and posts on how to interact with Project Gutenberg). 

Ripper Press Reports Dataset: https://digitalhumanities.wlu.edu/blog/2016/12/12/ripper-dataset/. Created by Brandon Walsh, UVA Scholars’ Lab. "The dataset features the full texts of 2677 newspaper articles between the years of 1844 and 1988 that reference the Whitechapel murders by Jack the Ripper. While the bulk of the texts are, in fact, contemporary to the murders, a handful of them skew closer to the present as press reports for contemporary crimes look back to the infamous case. The wide variety of sources available here gives a sense of how the coverage of the case differed by region, date, and publication."

Data is Plural: Data is Plural - Structured Archive. A weekly newsletter highlighting interesting datasets. Most are numeric or categorical datasets, but it does list some textual datasets, which are highlighted here:

  • Airplane confidential. (Text narratives of flight safety, from NASA.)
  • Congressional Research Service, in bulk. 
  • Drug patents and exclusivity, from the FDA
  • One million comic book panels.  
  • Supreme Court Transcripts, this time from Oyez.org
  • Xkcd transcripts
  • Chyrons
  • Index Thomisticus
  • An obviously perfect dataset (about sarcasm)
  • The State of the State of the States (State of the State addresses given by governors)
  • Foreign lobbyists
  • Drama.  (Drama Corpora Project - 800 plays in different languages)
  • Euro-bank speeches
  • Environmental treaties
  • EU laws
  • Coronavirus research papers
  • Pandemic-era economic policies
  • Privacy policies
  • Six million parliamentary speeches (from 9 countries)
  • Poems by kids
  • Police violence at the BLM protests
  • Tech’s BLM statements
  • The Green Books
  • New policing bills
  • House work (House of Representatives Job and Internship Announcements)
     

Data Repositories

Dataverse: allows scholars to share, publish, and archive their data, as well as find and cite data across all research fields. 

Wikidata: https://www.wikidata.org/ 

What is an API?