Skip to Main Content
George Mason University | University Libraries
See Updates and FAQs for the latest library services updates. Subject Librarians are available for online appointments, and Virtual Reference has extended hours.

QUANTitative Analysis & Statistics

For those taking a statistics and [quantitative] data analysis courses and/or doing a data analysis project.

When I have a consultation with someone, this is the information I want to know. Each greatly affects the future of the project and what needs to be done. So, it is important that you consider these issues ahead of time yourself.

1. Project

What are the characteristics of the data file that will be needed to answer the question(s) you have, given the complexity of the analysis you plan to use?

  • What type of analysis are you doing? Depending on requirements related to your field and the type of project you are completing (e.g., dissertation), analyses could include: a) Descriptive Statistics, b) Bivariate Inferential Statistics, c) Standard multivariate analyses such as ANOVA and Multiple Regression (Linear or Logistic), or d) other Modeling (including GLM, Mixed Models, or SEM). This will influence how much data is needed, necessary software, and the criteria the data must meet.
  • What are your research questions? It is useful to be able to conceptualize the ultimate data table you will need and even create it with some example data. The research questions should define  a) the unit of analysis, b) the population (observations), and the constructs (variables).

2. Observations

Is the data you have capable of being the data you need?

  • What is the unit of observation? Is the unit of observation a person, a place, or something else? Is it the same as the unit of analysis? If you have multiple data tables/files, is the unit of observation the same and how will the rows match up--you can aggregate your data to make the units bigger, but not the opposite. Are there repeated measures, multi-level, time series, or panel data? If so, you might need to reshape the data or use special analyses.
  • How [well] does the data reflect the target population?  How much of the data is actually from your target population? Both primary and secondary data may have missing data or extra observations that are not relevant, but it is important to end up with a sufficient number of observations. Also, what type of sampling method was used, if any? Were snowball techniques used, or complex sampling methods used to select the samples? Do you even want to generalize from your sample to the population, and how much work will it be if so? 

3. Data Files

What data importing and cleaning steps will you need to do to even get started? Do you have enough time and knowledge to work with your data?

  • Where did the data come from? The data source gives a good idea of the data quality and how much time and effort will be needed to process it. Large, reputable, data collectors will have clean data with lots of documentation. Smaller sources might need more checking for data errors and may not have all the information necessary. If you have collected you own data, you may not need to narrow down the variables needed, but there are various cleaning steps you will need to take depending on the mode of collection (e.g., mail vs online), population (e.g., paid vs volunteer), and other factors.
  • What format is the data in?  Is it on paper, in Excel, a file in SPSS format, or what? Is there just one file/table, or multiple? The data format affects how quickly you will be able to use the data in your statistical software. Recent versions of all major software packages can directly open files from each other (and Qualtrics can export to SPSS format). But conversions always require checking. Older or smaller data may even come in a text format like CSV or fixed format that are more difficult to open and do not already have labels. 

Frequency Tables

Frequency Tables - What to look for

Unless you know a variable is continuous or open-ended text, generate a frequency table. Note that the question label is not sufficient to know (e.g., age could be measured in any number of ways). The below is what to look for:

1. Overall Length and Values ► Right Track?

Is a frequency table appropriate? What do you see in the possible values overall?  

  • ≈10 values or fewer? A frequency table is a good start, go to step 2
  • Otherwise...
    • Mostly text? Check the values and/or documentation to determine whether this represents open-ended text (if so, see instructions for that variable type). Otherwise it may be a large grouping variable (like Occupation, State, or Country). You may simply want to report the most frequent groups. But, if it makes sense to reduce the number of categories for analysis, then do so and re-evaluate. Otherwise, this should have been anticipated and you would have a plan for it.
    • Mostly numbers? First determine whether this might represent an ID number, especially if there is a consistent frequency for each value. ID numbers are important to keep, and illuminate your unit of analysis, but are not analyzed. Otherwise, it is likely a continuous variable and you should use a histogram to explore it further.

2. Possible Values ► Measurement Level and Data Prep

  • Look at each possible value, the first column in the table. Ensure you know what it means, checking the documentation or questionnaire if not. Discrepancies could indicate an error, or lack of necessary labels or formatting (e.g., dates). 
    • Dates, Times, Latitude/Longitude? Look or ask for guidance on these special types. 
    • if you see a number, does it represent a count, a continuous measurement, or an unlabeled group?
    • If you see text, is it stored as shown, or is it a label for an underlying numerical value?
  • Look at all the values together.
    • Is there an ordering? If so, is every value part of that order and in the right "place"?
    • Do any values represent missings or non-answers? These may be labeled so or use the values 9, 99, etc or -1, -9, etc.
    • Do any of the values suggest that another variable might have relevant information to integrate (e.g., "other", "specify", or the lack of all possible values)

3. Frequencies ► Analysis Potential

  • Look at percentages of cases in each category (of valid cases, if provided). If there is not an overall order and any are 80%+ (especially 90%+), the variable may not be easy to analyze, so get advice if you really want to include it. If any are < 5%,  consider whether it would make sense to combine that value with others. For statistical testing there would typically be 2 or 3 groups, each with 20%+ of the data. However, 4 or 5 groups and 10%+ of the data is okay when necessary.
  • Look for the total valid values. If your table does not show any missing values, does the total match the number of records? If you see labels that represent missing values (or 9/99/-1), ensure that the output is separating them, or the software may not recognize them as missing.

Decision Time: Can you analyze it?

  • Does the variable address the appropriate construct for your research question?
  • Is there enough valid data for the intended analysis?
  • Will you treat it as continuous, categorical, non-parametric, or as fixed factors?

 

Tasks

Tasks to perform with a new dataset

  1. Read - Represent the data accurately in your statistical software
  2. Clean/Tidy - Create a single tidy and well-formatted dataset with no errors or invalid data
  3. Prepare/Add - Calculate, categorize variables (e.g., factor analysis) or values, combine variables or values, and create indexes
  4. Refine/Remove - Keep only the necessary variables and observations, and impute missing values where appropriate
  5. Visualize - Create graphs and/or tables to check and report on distributions, first univariate then bivariate. 
  6. Analyze - Perform statistical testing or modeling  (see another list)