Project Providence
Decentralized, idempotent data provenance and cleaning engine for enterprise AI training data — cryptographic mutation lineage, executor-abstracted human or agent labor, and statistical quality control.
Independent AI-infrastructure thesis. Not affiliated with any single employer.
Overview
Providence treats AI training data preparation as a gig-economy factory conveyor with audit-grade provenance. Complex labeling, cleaning, and evaluation pipelines decompose into atomic, state-free micro-tasks (f(D_in, S) = D_out) that can be routed interchangeably to human workers or AI agents. Every mutation is hash-chained into an immutable ledger, so every downstream decision — and every model trained on the data — has a traceable lineage back to the operation, the executor, and the spec that produced it.
The Opportunity
The data labeling layer is crowded, but provenance is still treated as a logging concern, not infrastructure. Enterprise AI teams need decision-level provenance (who decided what, under which spec, and how confident was the consensus), a portable spec/operation registry that survives vendor swaps, and a quality control regime that holds under both human and agent execution. Compliance regimes (EU AI Act, sectoral privacy law) are about to make this non-optional.
Approach
Wedge: decision-level provenance for labeling and evaluation pipelines, plus a versioned, portable spec/operation registry. Reference architecture: idempotent micro-task primitives, executor-abstraction layer, hash-chained mutation ledger in Postgres first (external anchor optional later), stochastic consensus via Gold/Poison/Greenfield task distribution and Blind Duplicate Routing, dynamic worker rating with exponential decay, trip-wires and expert triage for ambiguous cases. Anchored in Conner's production AI/ML work (KineticCRM 151+ tests, EcoMetrics 105+ dbt gates) and Mercor-class data annotation context.
Focus Areas
- •Idempotent micro-task decomposition and operation registry
- •Cryptographic mutation lineage and client-facing provenance trees
- •Stochastic consensus and statistical QC (gold/poison/greenfield, blind duplicates)
- •Executor-abstracted labor: human and agent on the same rails
- •Compliance-grade audit trails for AI training data