InfoGuides: QUANTitative Analysis & Statistics: Planning

When I have a consultation with someone, this is the information I want to know. Each greatly affects the future of the project and what needs to be done. So, it is important that you consider these issues ahead of time yourself.

1. Project

What are the characteristics of the data file that will be needed to answer the question(s) you have, given the complexity of the analysis you plan to use?

What type of analysis are you doing? Depending on requirements related to your field and the type of project you are completing (e.g., dissertation), analyses could include: a) Descriptive Statistics, b) Bivariate Inferential Statistics, c) Standard multivariate analyses such as ANOVA and Multiple Regression (Linear or Logistic), or d) other Modeling (including GLM, Mixed Models, or SEM). This will influence how much data is needed, necessary software, and the criteria the data must meet.
What are your research questions? It is useful to be able to conceptualize the ultimate data table you will need and even create it with some example data. The research questions should define a) the unit of analysis, b) the population (observations), and the constructs (variables).

2. Observations

Is the data you have capable of being the data you need?

What is the unit of observation? Is the unit of observation a person, a place, or something else? Is it the same as the unit of analysis? If you have multiple data tables/files, is the unit of observation the same and how will the rows match up--you can aggregate your data to make the units bigger, but not the opposite. Are there repeated measures, multi-level, time series, or panel data? If so, you might need to reshape the data or use special analyses.
How [well] does the data reflect the target population? How much of the data is actually from your target population? Both primary and secondary data may have missing data or extra observations that are not relevant, but it is important to end up with a sufficient number of observations. Also, what type of sampling method was used, if any? Were snowball techniques used, or complex sampling methods used to select the samples? Do you even want to generalize from your sample to the population, and how much work will it be if so?

3. Data Files

What data importing and cleaning steps will you need to do to even get started? Do you have enough time and knowledge to work with your data?

Where did the data come from? The data source gives a good idea of the data quality and how much time and effort will be needed to process it. Large, reputable, data collectors will have clean data with lots of documentation. Smaller sources might need more checking for data errors and may not have all the information necessary. If you have collected you own data, you may not need to narrow down the variables needed, but there are various cleaning steps you will need to take depending on the mode of collection (e.g., mail vs online), population (e.g., paid vs volunteer), and other factors.
What format is the data in? Is it on paper, in Excel, a file in SPSS format, or what? Is there just one file/table, or multiple? The data format affects how quickly you will be able to use the data in your statistical software. Recent versions of all major software packages can directly open files from each other (and Qualtrics can export to SPSS format). But conversions always require checking. Older or smaller data may even come in a text format like CSV or fixed format that are more difficult to open and do not already have labels.

Tasks to perform with a new dataset

Read - Represent the data accurately in your statistical software
Clean/Tidy - Create a single tidy and well-formatted dataset with no errors or invalid data
Prepare/Add - Calculate, categorize variables (e.g., factor analysis) or values, combine variables or values, and create indexes
Refine/Remove - Keep only the necessary variables and observations, and impute missing values where appropriate
Visualize - Create graphs and/or tables to check and report on distributions, first univariate then bivariate.
Analyze - Perform statistical testing or modeling (see another list)