InfoGuides: Text Analysis Tools: Where to Start

Where to Start

1. Plan

The first thing to do before starting any text mining project is to envision your final product or project. Think about your research question(s) and what visualization(s) or methods would work best to explore that question. If you are unsure of what you want your final output to be, choose a tool that provides multiple outputs at one time. Voyant is one tool that allows you to see many visualizations at once.

2. Clean

Cleaning and parsing the text before uploading to any tool will help to streamline the process. Keep the following in mind:

Sections of publications such as forwards, acknowledgements, and bibliographies can skew the results.
You should remove page numbers, table of contents, and captions.
Some researchers remove stop words prior to uploading the text. Some tools, like Voyant, include a stop words list and will remove them for you.
Be aware that capitalized and non-capitalized words are counted separately, and spelling errors can mislead the results.

3. Process

Once you have your corpus ready for analysis, you can run it through the tool of your choice. As a general rule, web based tools are easier to use than downloadable software. The main drawback to online tools is that they can limit the size of the corpus being analyzed, and some force users to make their corpus public.

A good tutorial to review before starting your project is Text Analysis with the HathiTrust Research Center. The modules provide an introduction to text analysis and the research methods and workflows it encompasses.

OCRing Documents

As a part of preparing for your text analysis project, you might need to use optical character recognition (OCR) software on the documents that make up your corpus. OCR is the electronic identification of typed, handwritten, or printed text into machine-encoded text. OCR allows printed texts to be electronically edited, searched, and stored, and is often used in machine processes like text analysis.

The DDSS Lab has access to ABBYY FineReader, an OCR application that converts image documents (photos, scans, PDFs) into editable file formats (Word, Excel, RTF, HTML, searchable PDF, CSV, and txt). To use ABBYY FineReader, email datahelp@gmu.edu to set up an appointment.

You can also the free software Tesseract to OCR your documents. You should have experience using the command line in order to use Tesseract effectively, as it does not include a GUI.

Resources

Illinois Library, Introduction to OCR and Searchable PDFs
Moritz Mähr, Working with Batches of PDF Files
NYU Libraries Scholarly Communications and Information Policy Department, ABBYY FineReader Tutorial
Laura Turner O'Hara, Cleaning OCR'ed Text with Regular Expressions

Text Analysis Tools

Where to Start

OCRing Documents

Text Analysis Projects