Skip to main content
| University Libraries
All Mason Libraries Facilities are CLOSED. See FAQs. Subject Librarians are available for online appointments.

Text & Data Mining

Access text and data mining sources and text analysis tools.

Text Mining Explained

Text mining and analysis is used for identifying major trends across a large number of documents. Text mining is performed by using software or a programming language (e.g., Python) to analyze a corpus of text in order to identify key trends, such as word usage, or vocabulary changes over time. One example is showcasing the number of times a specific word appears in all of Shakespeare's plays. Instead of physically reading through all of the plays and counting by hand, computers can perform the same task in a fraction of the time.

Text analysis examples from the HathiTrust Research Center (HTRC)

Google Books Ngram Viewer

Plan, Clean, & Process

Plan. The first thing to do before starting any text mining project is to plan what your final product will be. The final product is typically framed by a research question that lends itself to a particular tool or visualization.

Clean. Cleaning and parsing the text before uploading texts to any tool will help to streamline the process.

  • Note that sections of publications such as, forwards, acknowledgements, and bibliographies, can skew the results.
  • Other things to remove prior to mining include: page numbers, table of contents, and captions. Some researchers also remove stop words just prior to upload the text.
  • Be aware that capitalized and non-capitalized words are counted separately, and spelling errors can mislead the results.

Process. Once you have your corpus ready for analysis, it is time to run it through the tool of your choice.

  • Some tools found in this guide are easier to use than others, with easier tools not providing as much flexibility.
  • As a general rule, web based tools are easier than downloadable software. The main drawback to online tools is they tend to limit the size of the corpus being used, and some even force users to make their corpus public.

Tip: If you are unsure of what you want your final output to be, choose a tool that provides multiple outputs at one time.

​​​Voyant is one tool that allows you to see some of the options available. You can choose which direction to go from there.

Related Guides

Limited Access Data Sets lists data sets that need to be used in the Digital Scholarship Center or have another restriction. To use these data sources, please contact us first. We have limited staff and we want to be sure that someone is available to help you. Use of these data sources are restricted to current George Mason University students, staff and faculty.

Citing Data provides detailed information and examples on how you should cite data in your research.

Data Visualization guidelines and best practices for visualizing data with a wide range of tools.

Find Data for Analysis: Data for Practice & Projects links to resources that include text corpora among their list of data sources.

Research Data Management Basics for understanding best practices for managing your data and keeping organized.

Software for Digital Scholarship see R, NVivo, and QDAMiner.