data

Data Pipelines & Transforms

When someone says 'we have a data pipeline,' they mean: automated steps that take raw inputs and produce clean, structured outputs. No more copy-paste from Excel into a dashboard.

  • Pipeline = extract (get data) → transform (clean, reshape, join) → load (into a usable form)
  • Transforms are the rules: standardize units, normalize names, fill gaps, validate ranges
  • Why separate transform from load? So you can rerun transforms without re-extracting
  • dbt, SQL, Python—different tools, same idea: codified transformation logic
  • The goal: raw data in, decision-ready data out. Repeatable. Auditable.

Real-world example

Supplier sends 10 Excel files with different formats

Each file has "recycled %" in a different column. One uses decimals (0.5), another percentages (50). Dates are DD/MM, MM-DD, and "Q1 2024."

  • Extract: Pull files from email, Drive, or API. Get raw bytes into the system.
  • Transform: Map "recycled %" from whichever column it's in. Normalize 0.5 and 50 to same format. Parse all date variants to ISO. Join with SKU reference table.
  • Load: Write clean rows to a table (or datastore) that analytics and dashboards read from.
  • Next week: New files arrive. Pipeline runs again. Same transforms. Same output shape. No manual cleanup.
  • The pipeline is the contract: "Give us messy input; we give you clean output."

Pipelines turn "someone sent a spreadsheet" into "we have queryable, validated data."

A pipeline is a sequence of steps that moves data from source(s) to destination(s), with transformations in between. Extract (get it), Transform (clean and reshape it), Load (put it where it's needed). Sometimes ELT: load raw first, transform in place.

Raw data is inconsistent. Transforms apply rules: standardize units, normalize names, validate ranges, join with reference tables. The output becomes decision-ready—same structure, same semantics, every run.

Transforms should be codified (SQL, dbt, Python). That way they're repeatable, versioned, and testable. Manual fixes don't scale.

  • Multiple sources that need to be combined
  • Data that arrives regularly (daily, weekly) and needs the same treatment
  • Transform logic complex enough that manual work is error-prone
  • You need an audit trail: what ran, when, what changed
  • Baking transforms into reports instead of a dedicated layer—hard to reuse and test
  • No failure handling—pipeline breaks, bad data flows through
  • Transforms that are opaque—no one knows why a value became X
  • Treating the pipeline as one-off scripts instead of maintained infrastructure

Need a data pipeline for your analytics?

More resources

All resources →