The resources listed on this page may be text and data mined for academic scholarship or educational purposes.
If you're interested in a source not listed below, contact datahelp@gmu.edu or your subject librarian to request information about text and data mining rights. We will need time to review the license agreement and terms of use, so please plan accordingly. Carrying out automated text and data mining on a database that violates its terms of use is a violation of the University's Responsible Use of Computing policy. For more details consult the Statement on Appropriate Use of Electronic Resources.
Title | Description |
---|---|
HathiTrust | A partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. Access HathiTrust Datasets and its Research Center for information about text and data mining. After accessing HathiTrust, you will need to create an account. Lesson on Text Mining in Python through the HTRC Feature Reader. |
Gale Digital Collections | These texts are in XML format. Currently, the data are available offline. Contact datahelp@gmu.edu to arrange for access and for more details. List of Gale data purchased for text mining. |
Adam Mathews Digital |
Secure online access to an API can be provided. Data can be extracted from the main collection website by automated software if informed about this so that they can monitor server performance. Contact Us to inquire about access. |
Berg Fashion Library |
Licensee and authorized users may not carry out any text and data mining without proper consent in writing. Contact Us to inquire about access. |
BioMed Central | Provides an open access full-text corpus developed for text mining research. |
Brill Academic Publishers | Includes Brill online content and e-books. Contact Us to inquire about access. |
Caselaw Access Project | The Caselaw Access Project (“CAP”) expands public access to U.S. law. Its goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library. Note that you will need to create an account. |
Chronicling America | A database of historical newspapers from 1789 to 1922 provided by the Library of Congress. All newspapers are public domain. |
CORE | CORE is the world's largest collection of open access research papers. CORE provides access to content and data, through APIs. |
Digital Public Library of America (DPLA) | Access to DPLA API Codex, Bulk Download, Technical Documentation, and Sample Code and Libraries. |
EEBO Text Creation Partnership (TCP I and TCP II) and Thomason Tracts. Contact us about text and data mining through the EEBO subscription. |
|
Elsevier Journals | The subscriber may access text and data mining service online via and API at http://dev.elsevier.com Contact Us to inquire about access. |
English-Corpora.org | The corpora on this site were developed by Mark Davies, Linguistics Professor at BYU. Consult the list of available sources. |
Folger Shakespeare Library |
Provides digital copies of Shakespeare's plays, poems, and sonnets. |
General Index | Provides tables of words and short phrases contained in more than 100 million scientific journal articles, along with accompanying metadata. Users have to download the files and develop their own program(s) to mine the content. |
Internet Archive | Instructions for developers and those interested in bulk download and API access. |
JSTOR Data for Research | A self-service tool that enables exploration of both scholarly journal literature (more than 7 million journal articles) and a set of primary resources (19th Century British Pamphlets). |
LC for Robots | A list of the many ways the Library of Congress provides machine-readable access to its digital collections. |
Linguistic Data Consortium (LDC) | An open consortium of universities, companies and government research laboratories that creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. Offers access to a catalog of LDC resources. Users are required to create a personal account for access to text corpora. 2014 - 2016 & 2020 - present available for download. |
National Center for Biotechnology Information (NCBI) | NCBI has developed text mining tools for the analysis of biomedical scholarly publications and other texts. |
New York Times | New York Times Developer Portal allows you to access several APIs for mining New York Times publication data. |
Oxford Scholarship Online | Includes: Oxford English Dictionary, Oxford Reference Online, Oxford University Press Scholarship Online. Contact Us to inquire about access. |
The Oxford Text Archive | A collection of electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. Sources are usually available in a number of formats, and can also be directly uploaded to Voyant. |
PLOS | PLOS Search API gives developers access to rich data that can be flexibly integrated into applications for the web, desktop or mobile devices. |
Project Gutenberg | A large collection of free e-books available for download. Most are copyright free in the United States. |
ProQuest TDM Studio | TDM Studio puts the power of text and data mining directly in the researcher’s hands. TDM Studio offers a collection of ProQuest rights-cleared content across disciplines and format types and the ability to upload your own datasets. This content also must be part of Mason Libraries' existing ProQuest subscriptions. TDM Studio is available to Mason researchers who are familiar with using R or Python for mining and analyzing text. Please note that access is not immediate. Researchers need to allow time to get set up with TDM Studio and to account for any researchers who may be ahead of them in the queue. To get started, contact Data & Digital Scholarship Services (DDSS) at datahelp@gmu.edu to discuss your project and access options. See also: List of ProQuest Newspapers purchased for text mining. This is raw XML content purchased separately from TDM Studio. |
PubMed | PubMed Central hosts a number of important article datasets and makes their APIs and some code available via public code repositories. |
ScienceDirect | ScienceDirect API instructions and guide. |
Springer | Access to the Springer API. Contact Us to inquire about access. |
Taylor & Francis | Taylor & Francis Group Journals, Routledge, and CRCNetBASE ebooks. Contact Us to inquire about access. |
UCI Machine Learning Repository | The UCI Machine Learning Repository contains a variety of datasets, including text corpora, which can be used for text analysis. |
Web of Science | Clarivate Analytics has made access to the Web of Science (WoS) API available. Create a WoS account by accessing WoS Core Collection from the library website. Next, go to their developer site to register your API. Our subscription includes access to the Starter API. Contact Us to inquire about access. |
Additional Resources |
Contact Us to inquire about access to the following titles. African Blue Books |
Ask a Librarian | Hours & Directions | Mason Libraries Home
Copyright © George Mason University