1. Plan
The first thing to do before starting any text mining project is to envision your final product or project. Think about your research question(s) and what visualization(s) or methods would work best to explore that question. If you are unsure of what you want your final output to be, choose a tool that provides multiple outputs at one time. Voyant is one tool that allows you to see many visualizations at once.
2. Clean
Cleaning and parsing the text before uploading to any tool will help to streamline the process. Keep the following in mind:
3. Process
Once you have your corpus ready for analysis, you can run it through the tool of your choice. As a general rule, web based tools are easier to use than downloadable software. The main drawback to online tools is that they can limit the size of the corpus being analyzed, and some force users to make their corpus public.
A good tutorial to review before starting your project is Text Analysis with the HathiTrust Research Center. These slides provide an introduction to text analysis and the research methods and workflows it encompasses.
As a part of preparing for your text analysis project, you might need to use optical character recognition (OCR) software on the documents that make up your corpus. OCR is the electronic identification of typed, handwritten, or printed text into machine-encoded text. OCR allows printed texts to be electronically edited, searched, and stored, and is often used in machine processes like text analysis.
The DDSS Lab has access to ABBYY FineReader, an OCR application that converts image documents (photos, scans, PDFs) into editable file formats (Word, Excel, RTF, HTML, searchable PDF, CSV, and txt). To use ABBYY FineReader, email datahelp@gmu.edu to set up an appointment. View our step-by-step guide on how to use ABBYY FineReader.
You can also the free software Tesseract to OCR your documents. You should have experience using the command line in order to use Tesseract effectively, as it does not include a GUI.
Resources
Ask a Librarian | Hours & Directions | Mason Libraries Home
Copyright © George Mason University