Data Sources for Text Analytics

Last updated 2021; compiled by Logan Heiman

Research Data Services -- Data Discovery and Access: http://data.library.virginia.edu/datasources/

Collections of Textual Datasets

HathiTrust

Datasets -- text data from public domain works, available for bulk download: https://www.hathitrust.org/datasets
- Featured data sets: https://babel.hathitrust.org/cgi/mb?colltype=featured

Extracted features data set -- metadata and data elements extracted from volumes in HathiTrust, including materials under copyright: https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329

Text Collection Partnership (TCP) https://github.com/Text-Creation-Partnership

GitHub repository for the data collections made available through the Text Creation Partnership (TCP), including EEBO (Early English Books Online) Phase I, Evans Early American Imprints, and ECCO (Eighteenth Century Collection Online).

Early English Books Online - Navigations
https://github.com/Text-Creation-Partnership/EEBO-TCP-Collections-Navigations
A project funded by the National Endowment for the Humanities to select, key, and encode EEBO-TCP texts related to the theme of travel and navigation.
Evans Early American Imprints:
https://github.com/Text-Creation-Partnership/Evans-TCP

Eighteenth Century Collections Online:

https://github.com/Text-Creation-Partnership/ECCO-TCP (in TEI)

https://old.datahub.io/dataset/tcp-ecco-18th-century-texts (in plain text)

Over 2,000 texts made available by the ECCO Text Creation Partnership

British Library Content for Data Mining https://www.bl.uk/collection-guides/datasets-for-content-mining#

British Library Digital Collections and Data: https://data.bl.uk/

A creative 'space' developed by the BL Labs team for researchers to download large 'chunks' of the British Library's openly available data and digital collections so that they can experiment with them and develop new innovative projects.

BL Datasets: https://data.bl.uk/bl_labs_datasets/
There are 150 datasets available (as of 08/09/2020) for you to experiment with from the British Library's research repository. Collections include:
- Asian and African department (AAS) Card Catalogues (27 datasets)
- C M Taylor Keylogging Data (8 datasets)
- Digitised printed books (18th-19th century) (28 datasets)
- Digitised Hebrew Manuscripts (22 datasets)
- Ground Truth Transcriptions (3 datasets)
- India Office Medical Archives (3 datasets)
- Italian Academies (2 datasets)
- Judicial Committee of the Privy Council: Linked Appeals Data (1 dataset)
- Linked Open British National Bibliography (3 datasets)
- Maps, plans and topographical views (1 dataset)
- Pelagios Project (7 datasets)
- Quarterly Lists (2 datasets)
- "Single Sheet" thematic collections (8 datasets)
- SherlockNet (7 datasets)
- UK Web Archive (5 datasets)
- UK Doctoral Thesis Metadata from ETHOS (4 datasets)
- 3D representative models (21 datasets)

19th Century Printed Books https://data.bl.uk/digbks/
60K+ digitised volumes (around 25 million pages) published between 1789 and 1900 cover a wide range of subject areas including philosophy, history, poetry and literature.

Modern English Collection:

Public domain texts digitized by the UVA Library

Download texts: https://github.com/cruotolo/modern_english/blob/main/modern_english.zip
Browse titles: https://web.archive.org/web/20010201164600/http://etext.lib.virginia.edu:80/modeng/modeng0.browse.html

Documenting the Now

https://catalog.docnow.io/

“The DocNow Catalog is a collectively curated listing of Twitter datasets. Public datasets are shared as Tweet IDs, which can be hydrated back into full datasets using our Hydrator desktop application.

Project Gutenberg

https://www.gutenberg.org/

Free ebooks in the public domain. Not just fiction. Browse Bookshelf for categories.

US Government data

Congress.gov - Bill Status Bulk Data.
Govinfo.gov - Bulk Data Repository - multiple datasets:
- Congressional Bills
- Bill Status
- Bill Summaries
- Commerce Business Daily
- Code of Federal Regulations (Annual Edition)
- Electronic Code of Federal Regulations
- Federal Register
- United States Government Manual
- House Rules and Manual
- Privacy Act Issuances
- Public and Private Laws
- Public Papers of the Presidents of the United States
- Supreme Court Decisions 1937-1975
- Statutes at Large

GovInfo.gov - Featured Content. Browse interesting sources at GovInfo.gov, such as Presidential Inaugural Address and a collection of documents in memory of Ruth Bader Ginsburg. Also browse the complete (at least, born digital and some digitized) US federal government document collection by category.

Supreme Court: Oral Argument Transcripts, Opinions of the Court.

Department of Justice News API.

Data.gov listing of APIs. (This list seems incomplete.)

Library of Congress data

LC for Robots.
APIs:
- Loc.gov JSON API
- Chronicling America API
- American Archive of Public Broadcasting APIs
- World Digital Library APIs
Bulk data:
- Bulk data for Congress.gov bills, bill status, and bill summaries
- MARC records - bibliographic information for most of the Library’s collections. 25 million records are available for exploration in UTF-8, MARC8, and XML formats.
- Sample MARC data set and ReadMe file
- Chronicling America Bulk OCR Data – text only
- Chronicling America Bulk Data – image, metadata, and OCR text batches
- Dot Gov Datasets – audio, pdfs, and tabular data from .gov domains
- Web Cultures Datasets – memes and gifs from the American Folklife Center's Web Cultures Web Archive

DH Toychest

http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets

Includes demo corpora, which are “sample or toy collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials--e.g., for teaching text analysis, topic modeling, etc.”

Ripper Press Reports Dataset

https://digitalhumanities.wlu.edu/blog/2016/12/12/ripper-dataset/

Created by Brandon Walsh (UVA Scholars’ Lab). The dataset features the full texts of 2677 newspaper articles between the years of 1844 and 1988 that reference the Whitechapel murders by Jack the Ripper. While the bulk of the texts are, in fact, contemporary to the murders, a handful of them skew closer to the present as press reports for contemporary crimes look back to the infamous case. The wide variety of sources available here gives a sense of how the coverage of the case differed by region, date, and publication.

Data is Plural.

Data is Plural archive. A weekly newsletter highlighting interesting datasets. Most are numeric or categorical datasets, but it does list some textual datasets, which are highlighted here:

Airplane confidential. (Text narratives of flight safety, from NASA.)
Congressional Research Service, in bulk.
Drug patents and exclusivity, from the FDA
One million comic book panels.
Supreme Court Transcripts, this time from Oyez.org
Xkcd transcripts
Chyrons
Index Thomisticus
An obviously perfect dataset (about sarcasm)
The State of the State of the States (State of the State addresses given by governors)
Foreign lobbyists
Drama. (Drama Corpora Project - 800 plays in different languages)
Euro-bank speeches
Environmental treaties
EU laws
Coronavirus research papers
Pandemic-era economic policies
Privacy policies
Six million parliamentary speeches (from 9 countries)
Poems by kids
Police violence at the BLM protests
Tech’s BLM statements
The Green Books
New policing bills
House work (House of Representatives Job and Internship Announcements)

Data Repositories

Dataverse: allows scholars to share, publish, and archive their data, as well as find and cite data across all research fields

https://dataverse.lib.virginia.edu/ (UVA’s Dataverse)

https://dataverse.harvard.edu/ (Harvard’s Dataverse; includes data from many institutions)

Wikidata

Tools for Cleaning and Manipulating Your Data

Excel
OpenRefine: http://openrefine.org/
VoyantTools: https://voyant-tools.org/

“A web-based reading and analysis environment for digital texts”

Scripting languages (R, python, etc.) -- Scholars’ Lab or Research Data Services can advise