The resources listed on this page may be text and data mined for academic scholarship or educational purposes. If you intend to use an Artificial Intelligence (AI) tool with any of these resources, contact datahelp@gmu.edu first so we can review the terms of the license and ensure that is permissible activity.
If you're interested in a source not listed below, contact datahelp@gmu.edu or your subject librarian to request information about text and data mining rights. We will need time to review the license agreement and terms of use, so please plan accordingly. Carrying out automated text and data mining on a database that violates its terms of use is a violation of the University's Responsible Use of Computing policy. For more details consult the Statement on Appropriate Use of Electronic Resources.
Title | Description |
---|---|
HathiTrust | A partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. Access HathiTrust Datasets and its Research Center for information about text and data mining. After accessing HathiTrust, you will need to create an account. Lesson on Text Mining in Python through the HTRC Feature Reader. |
Gale Digital Collections | These texts are in XML format. Currently, the data are available offline. Contact datahelp@gmu.edu to arrange for access and for more details. List of Gale data purchased for text mining. |
Adam Mathews Digital |
Secure online access to an API can be provided. Data can be extracted from the main collection website by automated software if informed about this so that they can monitor server performance. Contact Us to inquire about access. |
Berg Fashion Library |
Licensee and authorized users may not carry out any text and data mining without proper consent in writing. Contact Us to inquire about access. |
BioMed Central | Provides an open access full-text corpus developed for text mining research. |
Brill Academic Publishers | Includes Brill online content and e-books. Contact Us to inquire about access. |
Caselaw Access Project | The Caselaw Access Project (“CAP”) expands public access to U.S. law. Its goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library. Note that you will need to create an account. |
Chronicling America | A database of historical newspapers from 1789 to 1922 provided by the Library of Congress. All newspapers are public domain. |
CORE | CORE is the world's largest collection of open access research papers. CORE provides access to content and data, through APIs. |
Digital Public Library of America (DPLA) | Access to DPLA API Codex, Bulk Download, Technical Documentation, and Sample Code and Libraries. |
EEBO Text Creation Partnership (TCP I and TCP II) and Thomason Tracts. Contact us about text and data mining through the EEBO subscription. |
|
Elsevier Journals | The subscriber may access text and data mining service online via and API at http://dev.elsevier.com Contact Us to inquire about access. |
English-Corpora.org | The corpora on this site were developed by Mark Davies, Linguistics Professor at BYU. Consult the list of available sources. |
Folger Shakespeare Library |
Provides digital copies of Shakespeare's plays, poems, and sonnets. |
General Index | Provides tables of words and short phrases contained in more than 100 million scientific journal articles, along with accompanying metadata. Users have to download the files and develop their own program(s) to mine the content. |
Internet Archive | Instructions for developers and those interested in bulk download and API access. |
JSTOR Data for Research | A self-service tool that enables exploration of both scholarly journal literature (more than 7 million journal articles) and a set of primary resources (19th Century British Pamphlets). |
LC for Robots | A list of the many ways the Library of Congress provides machine-readable access to its digital collections. |
Linguistic Data Consortium (LDC) | An open consortium of universities, companies and government research laboratories that creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. Offers access to a catalog of LDC resources. Users are required to create a personal account for access to text corpora. 2014 - 2016 & 2020 - present available for download. |
National Center for Biotechnology Information (NCBI) | NCBI has developed text mining tools for the analysis of biomedical scholarly publications and other texts. |
New York Times | New York Times Developer Portal allows you to access several APIs for mining New York Times publication data. |
Oxford Scholarship Online | Includes: Oxford English Dictionary, Oxford Reference Online, Oxford University Press Scholarship Online. Contact Us to inquire about access. |
The Oxford Text Archive | A collection of electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. Sources are usually available in a number of formats, and can also be directly uploaded to Voyant. |
PLOS | PLOS Search API gives developers access to rich data that can be flexibly integrated into applications for the web, desktop or mobile devices. |
Project Gutenberg | A large collection of free e-books available for download. Most are copyright free in the United States. |
ProQuest TDM Studio |
TDM Studio is ProQuest's text data mining platform that enables TDM on licensed ProQuest content. TDM Studio consists of two components: the Workbench is designed for experienced researchers who use their own coding methodologies, and Visualization is designed for users of all levels to quickly spot trends and generate insights. For more information, see the ProQuest TDM Studio infoguide. |
PubMed | PubMed Central hosts a number of important article datasets and makes their APIs and some code available via public code repositories. |
ScienceDirect | ScienceDirect API instructions and guide. |
Springer | Access to the Springer API. Contact Us to inquire about access. |
Taylor & Francis | Taylor & Francis Group Journals, Routledge, and CRCNetBASE ebooks. Contact Us to inquire about access. |
UCI Machine Learning Repository | The UCI Machine Learning Repository contains a variety of datasets, including text corpora, which can be used for text analysis. |
Web of Science | Clarivate Analytics has made access to the Web of Science (WoS) API available. Create a WoS account by accessing WoS Core Collection from the library website. Next, go to their developer site to register your API. Our subscription includes access to the Starter API. Contact Us to inquire about access. |
Additional Resources |
Contact Us to inquire about access to the following titles. African Blue Books |
Ask a Librarian | Hours & Directions | Mason Libraries Home
Copyright © George Mason University