Navigation

Search DH@UVA

DH@UVA, U.Va.
Your Portal to the Digital Humanities at the University of Virginia

Data Sources for Text Analytics

Last updated 2021; compiled by Logan Heiman

 

Research Data Services -- Data Discovery and Access: http://data.library.virginia.edu/datasources/
 

Collections of Textual Datasets

HathiTrust

 

 

Text Collection Partnership (TCP) https://github.com/Text-Creation-Partnership 

GitHub repository for the data collections made available through the Text Creation Partnership (TCP), including EEBO (Early English Books Online) Phase I, Evans Early American Imprints, and ECCO (Eighteenth Century Collection Online).

 

  • Eighteenth Century Collections Online:

https://github.com/Text-Creation-Partnership/ECCO-TCP (in TEI)

https://old.datahub.io/dataset/tcp-ecco-18th-century-texts (in plain text)

Over 2,000 texts made available by the ECCO Text Creation Partnership

 

British Library Content for Data Mining https://www.bl.uk/collection-guides/datasets-for-content-mining#

A creative 'space' developed by the BL Labs team for researchers to download large 'chunks' of the British Library's openly available data and digital collections so that they can experiment with them and develop new innovative projects.

  • BL Datasets: https://data.bl.uk/bl_labs_datasets/
    There are 150 datasets available (as of 08/09/2020) for you to experiment with from the British Library's research repository. Collections include:

    • Asian and African department (AAS) Card Catalogues (27 datasets)

    • C M Taylor Keylogging Data (8 datasets)

    • Digitised printed books (18th-19th century) (28 datasets)

    • Digitised Hebrew Manuscripts (22 datasets)

    • Ground Truth Transcriptions (3 datasets)

    • India Office Medical Archives (3 datasets)

    • Italian Academies (2 datasets)

    • Judicial Committee of the Privy Council: Linked Appeals Data (1 dataset)

    • Linked Open British National Bibliography (3 datasets)

    • Maps, plans and topographical views (1 dataset)

    • Pelagios Project (7 datasets)

    • Quarterly Lists (2 datasets)

    • "Single Sheet" thematic collections (8 datasets)

    • SherlockNet (7 datasets)

    • UK Web Archive (5 datasets)

    • UK Doctoral Thesis Metadata from ETHOS (4 datasets)

    • 3D representative models (21 datasets)

 

  • 19th Century Printed Books https://data.bl.uk/digbks/
    60K+ digitised volumes (around 25 million pages) published between 1789 and 1900 cover a wide range of subject areas including philosophy, history, poetry and literature.

     

Modern English Collection: 

Public domain texts digitized by the UVA Library 

 

Documenting the Now

https://catalog.docnow.io/ 

“The DocNow Catalog is a collectively curated listing of Twitter datasets. Public datasets are shared as Tweet IDs, which can be hydrated back into full datasets using our Hydrator desktop application.

 

Project Gutenberg

https://www.gutenberg.org/ 

Free ebooks in the public domain.  Not just fiction. Browse Bookshelf for categories. 

 

US Government data

  • Congress.gov - Bill Status Bulk Data.

  • Govinfo.gov - Bulk Data Repository - multiple datasets:

    • Congressional Bills

    • Bill Status

    • Bill Summaries

    • Commerce Business Daily

    • Code of Federal Regulations (Annual Edition)

    • Electronic Code of Federal Regulations

    • Federal Register

    • United States Government Manual

    • House Rules and Manual

    • Privacy Act Issuances

    • Public and Private Laws

    • Public Papers of the Presidents of the United States

    • Supreme Court Decisions 1937-1975

    • Statutes at Large

 

  • GovInfo.gov - Featured Content.  Browse interesting sources at GovInfo.gov, such as Presidential Inaugural Address and a collection of documents in memory of Ruth Bader Ginsburg. Also browse the complete (at least, born digital and some digitized) US federal government document collection by category

 

 

 

 

Library of Congress data

  • LC for Robots

  • APIs:

    • Loc.gov JSON API

    • Chronicling America API

    • American Archive of Public Broadcasting APIs

    • World Digital Library APIs

  • Bulk data:

    • Bulk data for Congress.gov bills, bill status, and bill summaries

    • MARC records - bibliographic information for most of the Library’s collections. 25 million records are available for exploration in UTF-8, MARC8, and XML formats.

    • Sample MARC data set and ReadMe file

    • Chronicling America Bulk OCR Data – text only

    • Chronicling America Bulk Data – image, metadata, and OCR text batches

    • Dot Gov Datasets – audio, pdfs, and tabular data from .gov domains

    • Web Cultures Datasets – memes and gifs from the American Folklife Center's Web Cultures Web Archive

 

DH Toychest 

http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets

Includes demo corpora, which are “sample or toy collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials--e.g., for teaching text analysis, topic modeling, etc.”

 

Ripper Press Reports Dataset 

https://digitalhumanities.wlu.edu/blog/2016/12/12/ripper-dataset/ 

Created by Brandon Walsh (UVA Scholars’ Lab). The dataset features the full texts of 2677 newspaper articles between the years of 1844 and 1988 that reference the Whitechapel murders by Jack the Ripper. While the bulk of the texts are, in fact, contemporary to the murders, a handful of them skew closer to the present as press reports for contemporary crimes look back to the infamous case. The wide variety of sources available here gives a sense of how the coverage of the case differed by region, date, and publication.

 

Data is Plural. 

Data is Plural archive.  A weekly newsletter highlighting interesting datasets. Most are numeric or categorical datasets, but it does list some textual datasets, which are highlighted here:

  • Airplane confidential. (Text narratives of flight safety, from NASA.)

  • Congressional Research Service, in bulk. 

  • Drug patents and exclusivity, from the FDA

  • One million comic book panels.  

  • Supreme Court Transcripts, this time from Oyez.org

  • Xkcd transcripts

  • Chyrons

  • Index Thomisticus

  • An obviously perfect dataset (about sarcasm)

  • The State of the State of the States (State of the State addresses given by governors)

  • Foreign lobbyists

  • Drama.  (Drama Corpora Project - 800 plays in different languages)

  • Euro-bank speeches

  • Environmental treaties

  • EU laws

  • Coronavirus research papers

  • Pandemic-era economic policies

  • Privacy policies

  • Six million parliamentary speeches (from 9 countries)

  • Poems by kids

  • Police violence at the BLM protests

  • Tech’s BLM statements

  • The Green Books

  • New policing bills

  • House work (House of Representatives Job and Internship Announcements)

 

Data Repositories

  • Dataverse: allows scholars to share, publish, and archive their data, as well as find and cite data across all research fields

https://dataverse.lib.virginia.edu/ (UVA’s Dataverse)

https://dataverse.harvard.edu/ (Harvard’s Dataverse; includes data from many institutions)

 

Tools for Cleaning and Manipulating Your Data

“A web-based reading and analysis environment for digital texts”

  • Scripting languages (R, python, etc.) -- Scholars’ Lab or Research Data Services can advise