InfoGuides: QUANTitative Analysis & Statistics: FAQs

What Data Format Should I Download?

Many sources and software offer downloads in multiple formats. For example, Qualtrics can export in SPSS (.sav) format.
Choose a statistical software format when possible to include labels and other metadata.
Major statistical software pakcages can import data from a wide variety of file formats.
Files with a .zip or .tar.gz extension are just compressed folders and should be decompressed first.

What File Format to Choose?

Your statistical software -- Reduces possible errors
Any statistical software -- Includes labels and other metadata
- SPSS (.sav), Stata (.dta), SAS (.sas7bdat), R(.rdata), Jamovi (.omv), etc.
Fixed Format with a Setup File **
- Setup Script: SPSS (.sps), Stata (.do and .dct), SAS (.sas), R(.R)
- Data File: Text or ASCII file (.dat or .txt) -- will not have variable names at the top
Spreadsheet format
- "Delimited" files: Comma Separated Values (.csv) or Tab Separated Values (.tsv or .tab)
- Excel or the equivalent (e.g., .xls, .xlsx)
- May be labeled ASCII or given as .dat or .txt, (but without a Setup Script file as in #3)
Other formats:
- Structured Text: JSON (.json), XML (.xml)
- Many more!

** Requires a Setup Script, Data file, and access to the designated software. Look for instructions in the Script file explaining how to specify the location of the datafile and "run" the script. For assistance, talk to your local Data Librarian.

Common Import Problems

Especially if your data file is in a format different from your software, there can be problems when you first open it. Check for these common issues.

String fields may get truncated (lose text) if the field size is not guessed correctly by the software, or look garbled if the wrong decoding is used (e.g., ASCII vs UTF-8). Check fields that may have lots of text (e.g., open-ended answers) for complete content. You will have to re-import, specifying the correct settings.
Variable Names may be changed to be valid. Spaces will be removed or converted to underscores. Alternatively, the name could be changed to something generic, like V1 or X. Specify names during import, or rename afterwards.
Special Field Types
- Date fields may not be parsed or interpreted correctly. Check missing values and spot-check the year against the original source.
- Numeric fields with non-number characters (including $, %, &, or .) may be imported as strings, or may cause the entire value to be set as missing.
- If necessary, import these fields as strings to start and use the appropriate software functions to properly format the data at a later point if needed. Set up primary data properly to avoid these issues.

Additional Import Problems for delimited (e.g., CSV) or other text files:

Variables not separated properly into columns. This may only affect a few observations, but leads to obviously incorrect values in some variables. One observation might also be split across multiple rows. Check frequency tables for excess missing values or unexplainable values. In most cases, you will have to re-import with the proper settings to fix this.
Observations might disappear due to file corruption, or extra rows (with mostly missing values) could be included due to headers, notes, or summaries. Confirm that the number of observations matches expectations. If there are extra rows, they can be deleted. If the file is corrupted, you may have to re-obtain it.

What Do I Look For In A Frequency Table?

Unless you know a variable is continuous or open-ended text, generate a frequency table or a bar chart. Note that the question label is not sufficient to know (e.g., age could be measured in any number of ways). The below is what to look for:

1. Overall Length and Values ► Right Track?

Is a frequency table appropriate? What do you see in the possible values overall?

≈10 values or fewer? A frequency table or bar chart is a good start, go to step 2
Otherwise, look at some of the values. Are they...
- Mostly text? Check the values and/or documentation to determine whether this represents open-ended text (if so, see instructions for that variable type). Otherwise it may be a large grouping variable (like Occupation, State, or Country). You may simply want to report the most frequent groups. But, if it makes sense to reduce the number of categories for analysis, then do so and re-evaluate. Otherwise, this should have been anticipated and you would have a plan for it.
- Mostly numbers? First determine whether this might represent an ID number, especially if there is a consistent frequency for each value. ID numbers are important to keep, and illuminate your unit of analysis, but are not analyzed. Otherwise, it is likely a continuous variable and you should use a histogram to explore it further.

2. Possible Values ► Measurement Level and Data Prep

Look at each possible value, the first column in the table or the labels for the bars. Ensure you know what it means, checking the documentation or questionnaire if not. Discrepancies could indicate an error, or lack of necessary labels or formatting (e.g., dates).
- Dates, Times, Latitude/Longitude? Look or ask for guidance on these special types.
- if you see a number, does it represent a count, a continuous measurement, or an unlabeled group?
- If you see text, is it stored as shown, or is it a label for an underlying numerical value?
Look at all the values together.
- Is there an ordering? If so, is every value part of that order and in the right "place"?
- Do any values represent missings or non-answers? These may be labeled so and/or use the values 9, 99, etc or -1, -9, etc.
- Do any of the values suggest that another variable might have relevant information to integrate (e.g., "other", "specify", or the lack of all possible values)

3. Frequencies ► Analysis Potential

Look at percentages of cases in each category (of valid cases, if provided). For statistical testing there would typically be 2 or 3 groups, each with at least 20% of the data, as that is the easiest to interpret.
- If you have an ordinal variable, it easiest to have 3 to 5 groups in which the frequencies are normally or uniformly distributed.
- If you have a nominal variable (no ordering), combine groups where possible. If you cannot have 5 or fewer groups, each with at least 5% observations, or if any of those groups have fewer than 10 observations, consult with an experienced researcher.
Look for the total valid values. If your table does not show any missing values, does the total match the number of records? If you see labels that represent missing values (or 9/99/-1), ensure that the output is separating them, or the software may not recognize them as missing.

Decision Time: Can you analyze it?

Does the variable address the appropriate construct for your research question?
Is there enough valid data for the intended analysis?
Will you treat it as continuous, categorical, non-parametric, or as fixed factors?

What Statistical Test Should I Do?

Statistical Tests: Choosing (~10 min) and more Practice Examples (~10 min) Dr Nic's Maths and Stats
- One-Sample, Independent, and Paired tests of means and proportions; Chi-square, Regression
Bivariate Analysis (~5 min) MMU Q-Step
- p-values, chi-square, t-test, ANOVA, post-hoc tests, correlation.
- Alternative: Statistics in 10 minutes (~10 min) Global Health with Greg Martin.
Statkat - Chooser, table, and practice exercises for all common analyses up to logistic regression

What Do You Need to Know? (for a Consultation)

When I have a consultation with someone, this is the information I want to know. Each greatly affects the future of the project and what needs to be done. So, it is important that you consider these issues ahead of time yourself.

1. Project

What are the characteristics of the data file that will be needed to answer the question(s) you have, given the complexity of the analysis you plan to use?

What type of analysis are you doing?

Depending on requirements related to your field and the type of project you are completing (e.g., dissertation), analyses could include: a) Descriptive Statistics, b) Bivariate Inferential Statistics, c) Standard multivariate analyses such as ANOVA and Multiple Regression (Linear or Logistic), or d) other Modeling (including GLM, Mixed Models, or SEM). This will influence how much data is needed, necessary software, and the criteria the data must meet.

What are your research questions?

It is useful to be able to conceptualize the ultimate data table you will need and even create it with some example data. The research questions should define a) the unit of analysis, b) the population (observations), and the constructs (variables).

If you do not have research questions, that usually means you are expected to come up with a research question you can answer with the data you have. If so, consider which of the topics addressed by the data interest you. You will want to choose two topics and look for relationships between the values or group differences.

2. Observations

Is the data you have capable of being the data you need?

What is the unit of observation?

Is the unit of observation a person, a place, or something else? Is it the same as the unit of analysis? Find the variable(s) that contain unique identifiers and determine their meaning. If you have multiple data tables/files, is the unit of observation the same and how will the rows match up--you can aggregate your data to make the units bigger, but not the opposite. Are there repeated measures, multi-level, time series, or panel data? If so, you might need to reshape the data or use special analyses.

How [well] does the data reflect the target population?

How much of the data is actually from your target population? Both primary and secondary data may have missing data or extra observations that are not relevant, but it is important to end up with a sufficient number of observations. Also, what type of sampling method was used, if any? Were snowball techniques used, or complex sampling methods used to select the samples? Do you even want to generalize from your sample to the population, and how much work will it be if so?

3. Data Files

What data importing and cleaning steps will you need to do to even get started? Do you have enough time and knowledge to work with your data?

Where did the data come from?

The data source gives a good idea of the data quality and how much time and effort will be needed to process it. Large, reputable, data collectors will have clean data with lots of documentation. Smaller sources might need more checking for data errors and may not have all the information necessary. If you have collected you own data, you may not need to narrow down the variables needed, but there are various cleaning steps you will need to take depending on the mode of collection (e.g., mail vs online), population (e.g., paid vs volunteer), and other factors.

What format is the data in?

Is it on paper, in Excel, a file in SPSS format, or what? Is there just one file/table, or multiple? The data format affects how quickly you will be able to use the data in your statistical software. Recent versions of all major software packages can directly open files from each other (and Qualtrics can export to SPSS format). But conversions always require checking. Older or smaller data may even come in a text format like CSV or fixed format that are more difficult to open and do not already have labels.

How Can I Get More Help?

Still need help? Ask the Community

These are also great sources to search for others who have the same issues. Definitely read some questions and answers before posting yourself to understand what details you need to include. It takes time to ask a good question that will receive a good answer.

General Statistics

How to ask a "good" question on CrossValidated?
Cross Validated - Q&A for Statistics
Reddit / AskStatistics - Not for homework help.

Software-Focused

Stack Overflow - For programmers, has tags for each software
Stata: StataList
SAS: SAS Support Community
R or Python: Data Science Learning Community Slack group
R: Posit Community - RStudio, tidyverse, Shiny, and Quarto/RMarkdown