1.2 Data Context
For any data set, you should always ask yourself a few questions to provide vital context about that data set.
Who is in the data set? What is the observational unit? How did they end up in the data set? Were they selected randomly or were they in a particular location a particular time?
What is being measured or recorded on each unit? What are the characteristics, features, or variables that were collected?
Where were they collected? In one location? Multiple locations?
When was the data collected? One point in time? Over time? If data quality degrades over time (e.g. lab specimens), we should be concerned.
How were they collected? What instruments and methods used for measurement? What questions were asked and how? Online survey? By phone? In person?
Why were they collected? For profit? For academic research? Are there conflicts of interest?
Who collected this data? An agency, a consortium of researchers, an individual researcher?
Thinking about this data context informs us how we analyze the data, what conclusions we can draw, and whether we can generalize our conclusions to a larger population.
Many of these data context questions also hint at general considerations for threats to data quality. Threats to data quality generally arise through sampling, information bias, and study design.