Data Quality Tests
When someone says 'we validated the data,' what does that mean? Data quality tests turn validation from opinion into repeatable, auditable checks that anyone can verify.
- Data quality tests = automated checks that run every time data flows through your pipeline
- Completeness, consistency, outliers—each dimension has specific rules
- 99.7% accuracy means 3 errors per 1,000 records; you decide if that's acceptable
- Tests should fail loudly so bad data never reaches decisions
- The goal: defensible claims. If you can't prove it, you can't claim it.
Real-world example
The supplier sends you 75 product sheets with environmental claims
You need to validate recycled content percentages, carbon footprints, and material types. Manually checking would take weeks. One error could mean greenwashing accusations.
- Completeness tests: Does every SKU have required fields? Missing = fail.
- Consistency tests: Do units match (kg vs lb)? Do percentages sum correctly?
- Outlier tests: Is recycled content >100%? Is carbon negative without explanation?
- Cross-reference tests: Do material codes match your reference database?
- With 105+ tests, most issues get caught before the data reaches anyone.
Automated tests turn "we checked it" into "here are the 105 checks that ran."
Data quality tests are automated checks that run on your data pipeline—every time data moves, transforms, or lands somewhere. They answer: Does this data meet the rules we agreed on?
- Completeness: Are required fields present? No blanks where blanks are not allowed.
- Consistency: Do units, formats, and references align across sources?
- Validity: Do values fall within expected ranges? (Percentages 0–100, dates in the future blocked)
- Uniqueness: Are there duplicate records that shouldn't exist?
- Cross-source checks: Does this match reference data or external sources?
Manual checks scale poorly and leave no audit trail. Automated tests run on every batch, every refresh, every integration. If a supplier changes their format or sends bad data, you find out immediately—before it contaminates reports or decisions.
Automated also means repeatable. You can show an auditor exactly which tests ran, when, and what passed or failed. That defensibility matters for compliance and greenwashing risk.
When someone says "99.7% accuracy," they mean: of the records that went through validation, 99.7% passed all tests. The remaining 0.3% were flagged, quarantined, or corrected.
- Ask: Accuracy of what? (Input data? Output? After human review?)
- Ask: What's the cost of the 0.3%? (Low: typos. High: wrong environmental claims.)
- The right accuracy target depends on your risk tolerance and compliance needs.
- Trusting data without knowing which tests ran
- Writing tests that always pass (too loose) or always fail (too strict)
- Treating "automated" as "set and forget"—tests need maintenance as rules evolve
- Hiding failures instead of surfacing them for fix workflows
See it in action
Need a data quality framework for your pipeline?