Excel
Open Refine: http://openrefine.org
Voyant Tools: https://voyant-tools.org/. "A web-based reading and analysis environment for digital texts."
Scripting Languages (R, Python, etc.): Scholars' Lab or Research Data Services can advise.
This page includes example data sources in the humanities that may be of interest to participants enrolled in the Winter 2021 DH workshop Python for Humanists as well as others looking for textual data to use for DH projects.
Research Data Services + Sciences: library.virginia.edu/data
Data Discovery and Access: https://library.virginia.edu/data/datasources/
Our licensed data sources of possible interest include the Burney Collection Newspapers: 17th and 18th Century and The Times Literary Supplement Archive files 1902-2011.
This wiki page, maintained by Alan Liu, includes links to demo corpora, which are “sample or toy collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials--e.g., for teaching text analysis, topic modeling, etc.”
http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets
Find text data from public domain works available for bulk download from HathiTrust at https://www.hathitrust.org/datasets.
The HathiTrust Research Center Extracted Features Dataset (2.0) includes metadata and data elements extracted from over 17 million volumes in HathiTrust, including materials under copyright.
The Text Collection Partnership (TCP) has a GItHub repository for their data collections available at https://github.com/Text-Creation-Partnership.
Data Collections include:
Read more about the TCP and its collections, and access raw files at https://textcreationpartnership.org/faq/
Find the guide for British Library Content for Data mining at https://www.bl.uk/collection-guides/datasets-for-content-mining#.
Includes:
LC for Robots: https://labs.loc.gov/lc-for-robots/. "Explore the many ways the Library of Congress provides machine-readable access to its digital collections."
APIs:
Bulk data:
Congress.gov: Bill Status Bulk Data
Govinfo.gov: Bulk Data Repository. Includes multiple datasets:
GovInfo.gov - Featured Content. Browse interesting sources at GovInfo.gov, such as Presidential Inaugural Addresses and a collection of documents in memory of Ruth Bader Ginsburg. Also browse the complete (at least, born digital) U.S. federal government document collection by category.
Supreme Court: Oral Argument Transcripts, Opinions of the Court.
Department of Justice: News API.
Data.gov: Listing of APIs. Note: This list seems incomplete.
Modern English Collection: Public domain texts digitized by the UVA library. Browse titles: https://web.archive.org/web/20010201164600/http://etext.lib.virginia.edu:80/modeng/modeng0.browse.html. Contact Chris Ruotolo if you are interested in downloading texts in this collection.
Documenting the Now: https://catalog.docnow.io/. "The DocNow Catalog is a collectively curated listing of Twitter datasets. Public datasets are shared as Tweet IDs, which can be hydrated back into full datasets using our Hydrator desktop application."
Project Gutenberg: https://www.gutenberg.org/. Free ebooks in the public domain. Not just fiction. Browse Bookshelf for categories. Some Project Gutenberg texts are available as part of the Natural Language Toolkit (NLTK), see the free online NLTK Book for how to use Python and NLTK o access the included Gutenberg corpus as well as other corpuses including the Brown Corpus, the Reuters Corpus, and the Inaugural Address Corpus. You can find a library to interface with Project Gutenberg using Python at: https://pypi.org/project/Gutenberg/ (online there are many tutorials and posts on how to interact with Project Gutenberg).
Ripper Press Reports Dataset: https://digitalhumanities.wlu.edu/blog/2016/12/12/ripper-dataset/. Created by Brandon Walsh, UVA Scholars’ Lab. "The dataset features the full texts of 2677 newspaper articles between the years of 1844 and 1988 that reference the Whitechapel murders by Jack the Ripper. While the bulk of the texts are, in fact, contemporary to the murders, a handful of them skew closer to the present as press reports for contemporary crimes look back to the infamous case. The wide variety of sources available here gives a sense of how the coverage of the case differed by region, date, and publication."
Data is Plural: Data is Plural - Structured Archive. A weekly newsletter highlighting interesting datasets. Most are numeric or categorical datasets, but it does list some textual datasets, which are highlighted here:
Dataverse: allows scholars to share, publish, and archive their data, as well as find and cite data across all research fields.
Wikidata: https://www.wikidata.org/