When I have a consultation with someone, this is the information I want to know. Each greatly affects the future of the project and what needs to be done. So, it is important that you consider these issues ahead of time yourself.
1. Project
What are the characteristics of the data file that will be needed to answer the question(s) you have, given the complexity of the analysis you plan to use?
- What type of analysis are you doing? Depending on requirements related to your field and the type of project you are completing, analyses could include: a) Descriptive Statistics, b) Bivariate Inferential Statistics, c) Multiple Regression (Linear or Logistic), or d) other Modeling (including Mixed or SEM). This will influence how much data is needed, as well as how exact of criteria the data must meet.
- What are your research questions? It is useful to be able to conceptualize the ultimate data table you will need and even create it with some example data. The research questions should define a) the unit of analysis, b) the population (observations), and the constructs (variables).
2. Observations
Is the data you have capable of being the data you need?
- What is the unit of observation? Is the unit of observation a person, a place, or something else? Is it the same as the unit of analysis? If you have multiple data tables/files, is the unit of observation the same and how will the observations match up. You can aggregate your data to make the units bigger, but cannot do the opposite. It is also important to know whether there are repeated measures, multi-level, time series, or panel data, as you might need to reshape the data or use special analyses.
- How well does the data reflect the target population? How much of the data is actually from your target population and how representative of that population are they? Both primary and secondary data may have missing data or extra observations that are not relevant, but it is important to end up with a sufficient number of observations.
3. Data Files
What data importing and cleaning steps will you need to do to even get started?
- Where did the data come from? The data source gives a good idea of the data quality and how much time and effort will be needed to process it. Large, reputable, data collectors will have clean data with lots of documentation. Smaller sources might need more checking for data errors and may not have all the information necessary. If you have collected you own data, there are various cleaning steps you will need to take depending on the mode of collection (e.g., mail vs online), population (e.g., paid vs volunteer), and other factors.
- What format is the data in? Is it on paper, in Excel, a file in SPSS format, or what? The data format affects how quickly you will be able to use the data in your statistical software. Recent versions of all major software packages can directly open files from each other (and Qualtrics can export to SPSS format). But, older or smaller data may come in a text format like CSV or fixed format that are more difficult to open and do not already have labels.