InfoGuides: Text & Data Mining Sources: Access Collections

Access Collections

The resources listed on this page may be text and data mined for academic scholarship or educational purposes. If you intend to use an Artificial Intelligence (AI) tool with any of these resources, contact datahelp@gmu.edu first so we can review the terms of the license and ensure that is permissible activity.

If you're interested in a source not listed below, contact datahelp@gmu.edu or your subject librarian to request information about text and data mining rights. We will need time to review the license agreement and terms of use, so please plan accordingly. Carrying out automated text and data mining on a database that violates its terms of use is a violation of the University's Responsible Use of Computing policy. For more details consult the Statement on Appropriate Use of Electronic Resources.

Title	Description
HathiTrust	A partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. Access HathiTrust Datasets and its Research Center for information about text and data mining. After accessing HathiTrust, you will need to create an account. Lesson on Text Mining in Python through the HTRC Feature Reader.
Gale Digital Collections	These texts are in XML format. Currently, the data are available offline. Contact datahelp@gmu.edu to arrange for access and for more details. List of Gale data purchased for text mining.
Adam Mathews Digital	Secure online access to an API can be provided. Data can be extracted from the main collection website by automated software if informed about this so that they can monitor server performance. Contact Us to inquire about access.
Berg Fashion Library	Licensee and authorized users may not carry out any text and data mining without proper consent in writing. Contact Us to inquire about access.
BioMed Central	Provides an open access full-text corpus developed for text mining research. Instructions for registering for the BMC API key.
Brill Academic Publishers	Includes Brill online content and e-books. Contact Us to inquire about access.
Caselaw Access Project	The Caselaw Access Project (“CAP”) expands public access to U.S. law. Its goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library. Note that you will need to create an account.
Chronicling America	A database of historical newspapers from 1789 to 1922 provided by the Library of Congress. All newspapers are public domain.
CORE	CORE is the world's largest collection of open access research papers. CORE provides access to content and data, through APIs.
Digital Public Library of America (DPLA)	Access to DPLA API Codex, Bulk Download, Technical Documentation, and Sample Code and Libraries.
Early English Books Online	EEBO Text Creation Partnership (TCP I and TCP II) and Thomason Tracts. Contact us about text and data mining through the EEBO subscription.
Elsevier Journals	The subscriber may access text and data mining service online via and API. Contact Us to inquire about access.
English-Corpora.org	The corpora on this site were developed by Mark Davies, Linguistics Professor at BYU. Consult the list of available sources. Contact us to request access to: Corpus of Contemporary American English (COCA) Corpus of Historical American English (COHA)
Folger Shakespeare Library	Provides digital copies of Shakespeare's plays, poems, and sonnets. Everything is available in a number of formats, and can also be read online.
General Index	Provides tables of words and short phrases contained in more than 100 million scientific journal articles, along with accompanying metadata. Users have to download the files and develop their own program(s) to mine the content.
Internet Archive	Instructions for developers and those interested in bulk download and API access.
JSTOR Data for Research	A self-service tool that enables exploration of both scholarly journal literature (more than 7 million journal articles) and a set of primary resources (19th Century British Pamphlets).
LC for Robots	A list of the many ways the Library of Congress provides machine-readable access to its digital collections.
Linguistic Data Consortium (LDC)	An open consortium of universities, companies and government research laboratories that creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. Offers access to a catalog of LDC resources. Users are required to create a personal account for access to text corpora. 2014 - 2016 & 2020 - present available for download.
National Center for Biotechnology Information (NCBI)	NCBI has developed text mining tools for the analysis of biomedical scholarly publications and other texts.
New York Times	New York Times Developer Portal allows you to access several APIs for mining New York Times publication data.
Oxford Scholarship Online	Includes: Oxford English Dictionary, Oxford Reference Online, Oxford University Press Scholarship Online. Contact Us to inquire about access.
The Oxford Text Archive	A collection of electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. Sources are usually available in a number of formats, and can also be directly uploaded to Voyant.
The Pile	The Pile: An 800GB Dataset of Diverse Text for Language Modeling is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources.
PLOS	PLOS Search API gives developers access to rich data that can be flexibly integrated into applications for the web, desktop or mobile devices.
Project Gutenberg	A large collection of free e-books available for download. Most are copyright free in the United States.
ProQuest TDM Studio	TDM Studio is ProQuest's text data mining platform that enables TDM on licensed ProQuest content. TDM Studio consists of two components: the Workbench is designed for experienced researchers who use their own coding methodologies, and Visualization is designed for users of all levels to quickly spot trends and generate insights. For more information, see the ProQuest TDM Studio infoguide. See also: List of ProQuest Newspapers purchased for text mining. This is raw XML content purchased separately from TDM Studio.
PubMed	PubMed Central hosts a number of important article datasets and makes their APIs and some code available via public code repositories.
ScienceDirect	ScienceDirect API instructions and guide.
Springer	Access to the SpringerNature API Playground. Contact Us to inquire about access to additional Springer APIs.
Taylor & Francis	Taylor & Francis Group Journals, Routledge, and CRCNetBASE ebooks. Contact Us to inquire about access.
UCI Machine Learning Repository	The UCI Machine Learning Repository contains a variety of datasets, including text corpora, which can be used for text analysis.
Web of Science	Clarivate Analytics has made access to the Web of Science (WoS) API available. Create a WoS account by accessing WoS Core Collection from the library website. Next, go to their developer site to register your API. Our subscription includes access to the Starter API and the Expanded API. Contact Us to inquire about access.
zAdditional Resources	Contact Us to inquire about access to the following titles. African Blue Books American Thoracic Society Choice Reviews Online Edinburgh University Press Journals Emerald Publishing Gallup Analytics ICE Virtual Library Institutional Investor Journal Journal on the Scholarship of Teaching & Learning Leadership Online NK News Pro Ovid LWW Journals SAGE Publications SPIE Digital Library University of California Press University of Chicago Press Journals