Skip to Main Content

PPIRS International Data Workshop - Session 1

This is a research guide designed to complement the presentation given by Jennifer Huck to PPIRS on Oct. 22, 2020.

Data versus Statistics

What is the difference between data and statistics?

  • Raw data is the output of measurement or observation.  
  • Can be analyzed using statistical software
  • In the social sciences, a dataset most commonly has variables (columns in a spreadsheet) and observations (rows).  


  • Statistics are the output after analyzing raw data.  This might be a table of counts or summaries of the data (e.g., a mean).  Graphs also frequently display statistical output.

That is a black and white description of the differences between data and statistics.  The reality is that there is a lot of gray area between the two.  That mostly depends on what the researcher is trying to do.  

FYI – I am mostly focusing on quantitative data in today's workshop. 

Different types of data

I really like the types of data defined by Keller, Lancaster, & Shipp (2017):

"Designed Data are generated in the pursuit of scientific discovery. Designed data include statistically designed data collections, such as data generated from: surveys, experimental designs, registries, and intentional observational collections.
Administrative Data, also referred to as “business practice” data, are collected for the administration of an organization, program, or service processes. These data provide an opportunity for gathering information that exists due to normal economic and social activity. Examples of administrative data include Internal Revenue Service data for individuals and businesses, Social Security earnings records, patent and trademark databases, Medicare and Medicaid health utilization data, banking and other financial data, industrial production processes, such as tracking supply chains end-to-end (Pires et al. 2017), taxi trip data, and local data generated from 911 calls and Emergency Management Services (EMS) responses, property assessment and tax data, and data from health and human services, parks and recreation, libraries, and environmental services (e.g., trash and recycling, water and utilities, projects and planning, transportation, and building permits).
Opportunity Data are data generated on an ongoing basis as society moves through its daily paces. Opportunity data are derived from a variety of sources such as GPS systems and embedded sensors, social media exchanges, mobile and wearable devices, and Internet entries. Captured through a variety of methods including direct flows, Internet searches, web crawling and scraping, these data may exist in a variety of electronic and physical modalities.
Procedural Data are data derived from policies, procedures, and legal requirements; they are the rules and regulations that govern and shape our lives. These policies and procedures affect our work, our personal lives, and society. Examples of procedural data include compensation policies, the Affordable Care Act, the Department of Defense policy “Don't Ask, Don't Tell,” and Supreme Court rulings."

Sallie Keller, Vicki Lancaster & Stephanie Shipp (2017) Building Capacity for Data-Driven Governance: Creating a New Foundation for Democracy. Statistics and Public Policy, 4:1, 1-11, DOI: 10.1080/2330443X.2017.1374897.

Characteristics of Social Science Data

These are questions that you need to ask your patron in advance.  All or only some might apply.

  • Time period – what years, for example 2000-present
  • Frequency – how frequently were data collected, for example annual or daily
  • Geography – For example: nation, or all counties in a state
  • The subject of a study (Unit of analysis).  For example:
    • A survey asking about people’s political participation, plus other information about the individual such as demographics, and their values → the unit of analysis is individuals
    • The number of votes presidential candidates received in all of the counties in an election → the unit of analysis is counties

Reviewing Datasets

What are the kinds of things I look for when I’m reviewing a new dataset?

  • Access – are the data open and readily available?  Open but requiring a data use agreement?  Or is it proprietary, likely requiring paying a fee, plus signing a data use agreement and/or a license?
  • Format – what format is the dataset in?  Is it already in a format that will enable easy import into statistical software?
  • Documentation – what kind of methods documents are available?  This is especially important when looking at a secondary datasets, especially surveys.  The documentation should answer questions like: how did the researchers collect the data?  What do the variables represent?  How do the researchers treat categorical variables, or missing data? 
    • ALWAYS read the documentation.

Learn More