data

Data Pipelines & Transforms

When someone says 'we have a data pipeline,' they mean: automated steps that take raw inputs and produce clean, structured outputs. No more copy-paste from Excel into a dashboard.

Pipeline = extract (get data) → transform (clean, reshape, join) → load (into a usable form)
Transforms are the rules: standardize units, normalize names, fill gaps, validate ranges
Why separate transform from load? So you can rerun transforms without re-extracting
dbt, SQL, Python—different tools, same idea: codified transformation logic
The goal: raw data in, decision-ready data out. Repeatable. Auditable.

Real-world example

Supplier sends 10 Excel files with different formats

Each file has "recycled %" in a different column. One uses decimals (0.5), another percentages (50). Dates are DD/MM, MM-DD, and "Q1 2024."

Extract: Pull files from email, Drive, or API. Get raw bytes into the system.
Transform: Map "recycled %" from whichever column it's in. Normalize 0.5 and 50 to same format. Parse all date variants to ISO. Join with SKU reference table.
Load: Write clean rows to a table (or datastore) that analytics and dashboards read from.
Next week: New files arrive. Pipeline runs again. Same transforms. Same output shape. No manual cleanup.
The pipeline is the contract: "Give us messy input; we give you clean output."

Pipelines turn "someone sent a spreadsheet" into "we have queryable, validated data."

A pipeline is a sequence of steps that moves data from source(s) to destination(s), with transformations in between. Extract (get it), Transform (clean and reshape it), Load (put it where it's needed). Sometimes ELT: load raw first, transform in place.

Raw data is inconsistent. Transforms apply rules: standardize units, normalize names, validate ranges, join with reference tables. The output becomes decision-ready—same structure, same semantics, every run.

Transforms should be codified (SQL, dbt, Python). That way they're repeatable, versioned, and testable. Manual fixes don't scale.

Multiple sources that need to be combined
Data that arrives regularly (daily, weekly) and needs the same treatment
Transform logic complex enough that manual work is error-prone
You need an audit trail: what ran, when, what changed

Baking transforms into reports instead of a dedicated layer—hard to reuse and test
No failure handling—pipeline breaks, bad data flows through
Transforms that are opaque—no one knows why a value became X
Treating the pipeline as one-off scripts instead of maintained infrastructure

Need a data pipeline for your analytics?

See my services Talk to me

More resources

All resources →