Skip to Main Content
| University Libraries
See Updates and FAQs for the latest library services updates. Subject Librarians are available for online appointments, and Virtual Reference has extended hours.

Text Analysis Tools

A companion to our Text and Data Mining Sources infoguide, this guide will take you through how to use several text analysis tools

Where to Start

1. Plan

The first thing to do before starting any text mining project is to envision your final product or project. Think about your research question(s) and what visualization(s) or methods would work best to explore that question. If you are unsure of what you want your final output to be, choose a tool that provides multiple outputs at one time. Voyant is one tool that allows you to see many visualizations at once. 

2. Clean

Cleaning and parsing the text before uploading to any tool will help to streamline the process. Keep the following in mind:

  • Sections of publications such as forwards, acknowledgements, and bibliographies can skew the results. 
  • You should remove page numbers, table of contents, and captions.
  • Some researchers remove stop words prior to uploading the text. Some tools, like Voyant, include a stop words list and will remove them for you. 
  • Be aware that capitalized and non-capitalized words are counted separately, and spelling errors can mislead the results. 

3. Process

Once you have your corpus ready for analysis, you can run it through the tool of your choice. As a general rule, web based tools are easier to use than downloadable software. The main drawback to online tools is that they can limit the size of the corpus being analyzed, and some force users to make their corpus public. 

A good tutorial to review before starting your project is Text Analysis with the HathiTrust Research Center. These slides provide an introduction to text analysis and the research methods and workflows it encompasses. 

OCRing Documents

As a part of preparing for your text analysis project, you might need to use optical character recognition (OCR) software on the documents that make up your corpus. OCR is the electronic identification of typed, handwritten, or printed text into machine-encoded text. OCR allows printed texts to be electronically edited, searched, and stored, and is often used in machine processes like text analysis. 

The DiSC Lab has access to ABBYY FineReader, an OCR application that converts image documents (photos, scans, PDFs) into editable file formats (Word, Excel, RTF, HTML, searchable PDF, CSV, and txt). To use ABBYY FineReader, email datahelp@gmu.edu to set up an appointment. View our step-by-step guide on how to use ABBYY FineReader. 

You can also the free software Tesseract to OCR your documents. You should have experience using the command line in order to use Tesseract effectively, as it does not include a GUI. 

Resources