2017 – Monday

Know Your Data Quality

Message of the Day 

Data quality is the degree to which data meets the purposes and requirements of its use. Depending on the uses, good quality data may refer to complete, accurate, credible, consistent or “good enough” data.

Things to consider

What is data quality and how can we distinguish between good and bad data? How are the issues of data quality being addressed in various disciplines?

  • Most straightforward definition of data quality is that data quality is the quality of content (values) in one’s dataset. For example, if a dataset contains names and addresses of customers, all names and addresses have to be recorded (data is complete), they have to correspond to the actual names and addresses (data is accurate), and all records are up-to-date (data is current).
  • Most common characteristics of data quality include completeness, validity, consistency, timeliness and accuracy. Additionally, data has to be useful (fit for purpose) and documented and reproducible / verifiable.
  • At least four activities impact the quality of data: modeling the world (deciding what to collect and how), collecting or generating data, storage/access, and formating / transformation
  • Assessing data quality requires disciplinary knowledge and is time-consuming
  • Data quality issues: how to measure, how to track lineage of data (provenance), when data is “good enough”, what happens when data is mixed and triangulated (esp. high quality and low quality data), crowdsourcing for quality
  • Data quality is responsibility of both data providers and data curators: data providers ensure the quality of their individual datasets, while curators help the community with consistency, coverage and metadata.

“Care and Quality are internal and external aspects of the same thing. A person who sees Quality and feels it as he works is a person who cares. A person who cares about what he sees and does is a person who’s bound to have some characteristic of quality.”
― Robert M. Pirsig, Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values

Stories 

Resources

Activities

  • Show your most recent dataset (or part of it) to your colleague and ask their opinion of its quality (exchanging datasets with a colleague makes this activity more fun).
  • Use criteria for good data (e.g., completeness, accuracy, fitness for use, documentation) to assess where your data stands.
  • Discuss your approaches to data collection and measures you took / could take to ensure integrity and completeness of your data.
  • Discuss steps to address missing or incomplete data in the context of your research. Does it matter? How much missing data affects validity, reliability or trustworthiness of your conclusions?
  • Check out the Calling Bullshit Syllabus (e.g., Food Stamp Fraud or the Musician Mortality Case Study) What can we learn about data quality from these stories?