Skip to Main Content
| University Libraries
See Updates and FAQs for the latest library services updates. Subject Librarians are available for online appointments, and Virtual Reference has extended hours.

Text & Data Mining Sources

Access text and data mining sources and text analysis tools.

Selected Texts

The resources listed on this page may be text and data mined for academic scholarship or educational purposes.

If you're interested in a source not listed below, contact datahelp@gmu.edu or your subject librarian to request information about text and data mining rights. We will need time to review the license agreement and terms of use, so please plan accordingly. Carrying out automated text and data mining on a database that violates its terms of use is a violation of the University's Responsible Use of Computing policy. For more details consult the Statement on Appropriate Use of Electronic Resources.

   
Title Description
HathiTrust A partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. Access HathiTrust Datasets and its Research Center for information about text and data mining. After accessing HathiTrust, you will need to create an account.
Lesson on Text Mining in Python through the HTRC Feature Reader.
Gale Digital Collections These texts are in XML format. Currently, the data are available offline. Contact datahelp@gmu.edu to arrange for access and for more details.

List of Gale data purchased for text mining.
Adam Mathews Digital

Secure online access to an API can be provided. Data can be extracted from the main collection website by automated software if informed about this so that they can monitor server performance.

Contact Us to inquire about access.

Berg Fashion Library

Licensee and authorized users may not carry out any text and data mining without proper consent in writing.

Contact Us to inquire about access.

BioMed Central Provides an open access full-text corpus developed for text mining research.
Brill Academic Publishers Includes Brill online content and e-books.
Contact Us to inquire about access.
Caselaw Access Project The Caselaw Access Project (“CAP”) expands public access to U.S. law. Its goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library. Note that you will need to create an account.
Chronicling America A database of historical newspapers from 1789 to 1922 provided by the Library of Congress. All newspapers are public domain.
CORE CORE is the world's largest collection of open access research papers. CORE provides access to content and data, through APIs.
Digital Public Library of America (DPLA) Access to DPLA API Codex, Bulk Download, Technical Documentation, and Sample Code and Libraries.

Early English Books Online

EEBO Text Creation Partnership (TCP I and TCP II) and Thomason Tracts.

Contact us about text and data mining through the EEBO subscription.

Elsevier Journals The subscriber may access text and data mining service online via and API at http://dev.elsevier.com
Contact Us to inquire about access.
English-Corpora.org The corpora on this site were developed by Mark Davies, Linguistics Professor at BYU. Consult the list of available sources.
Folger Shakespeare Library

Provides digital copies of Shakespeare's plays, poems, and sonnets.
Everything is available in a number of formats, and can also be read online.

General Index Provides tables of words and short phrases contained in more than 100 million scientific journal articles, along with accompanying metadata. Users have to download the files and develop their own program(s) to mine the content. 
Internet Archive Instructions for developers and those interested in bulk download and API access.
JSTOR Data for Research A self-service tool that enables exploration of both scholarly journal literature (more than 7 million journal articles) and a set of primary resources (19th Century British Pamphlets).
LC for Robots A list of the many ways the Library of Congress provides machine-readable access to its digital collections. 
Linguistic Data Consortium (LDC) An open consortium of universities, companies and government research laboratories that creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. Offers access to a catalog of LDC resources. Users are required to create a personal account for access to text corpora. 2014 - 2016 & 2020 - present available for download.
National Center for Biotechnology Information (NCBI) NCBI has developed text mining tools for the analysis of biomedical scholarly publications and other texts.
New York Times New York Times Developer Portal allows you to access several APIs for mining New York Times publication data.
Oxford Scholarship Online Includes: Oxford English Dictionary, Oxford Reference Online, Oxford University Press Scholarship Online.
Contact Us to inquire about access.
The Oxford Text Archive  A collection of electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. Sources are usually available in a number of formats, and can also be directly uploaded to Voyant.
PLOS PLOS Search API gives developers access to rich data that can be flexibly integrated into applications for the web, desktop or mobile devices. 
Project Gutenberg A large collection of free e-books available for download. Most are copyright free in the United States.
ProQuest TDM Studio TDM Studio puts the power of text and data mining directly in the researcher’s hands. TDM Studio offers a collection of ProQuest rights-cleared content across disciplines and format types and the ability to upload your own datasets. This content also must be part of Mason Libraries' existing ProQuest subscriptions.

TDM Studio is available to Mason researchers who are familiar with using R or Python for mining and analyzing text. Please note that access is not immediate. Researchers need to allow time to get set up with TDM Studio and to account for any researchers who may be ahead of them in the queue. To get started, contact the Digital Scholarship Center (DiSC) at datahelp@gmu.edu to discuss your project and access options.

See also: List of ProQuest Newspapers purchased for text mining. This is raw XML content purchased separately from TDM Studio.
PubMed PubMed Central hosts a number of important article datasets and makes their APIs and some code available via public code repositories.
ScienceDirect ScienceDirect API instructions and guide.
Springer Access to the Springer API.
Contact Us to inquire about access.
Taylor & Francis Taylor & Francis Group Journals, Routledge, and CRCNetBASE ebooks.
Contact Us to inquire about access.
UCI Machine Learning Repository The UCI Machine Learning Repository contains a variety of datasets, including text corpora, which can be used for text analysis.
Web of Science Clarivate Analytics has made access to the Web of Science (WoS) API available. Create a WoS account by accessing WoS Core Collection from the library website. Next, go to their developer site to register your API. Our subscription includes access to the Starter and the Lite APIs.
Contact Us to inquire about access.
Additional Resources

Contact Us to inquire about access to the following titles.

African Blue Books
American Thoracic Society
Choice Reviews Online
Edinburgh University Press Journals
Emerald Publishing
Gallup Analytics
ICE Virtual Library
Institutional Investor Journal
Journal on the Scholarship of Teaching & Learning
Leadership Online
NK News Pro
Ovid LWW Journals
SAGE Publications
SPIE Digital Library
University of California Press
University of Chicago Press Journals