AI Infrastructure & Provenancepositioning

Project Providence

Decentralized, idempotent data provenance and cleaning engine for enterprise AI training data — cryptographic mutation lineage, executor-abstracted human or agent labor, and statistical quality control.

Independent AI-infrastructure thesis. Not affiliated with any single employer.

Overview

Providence treats AI training data preparation as a gig-economy factory conveyor with audit-grade provenance. Complex labeling, cleaning, and evaluation pipelines decompose into atomic, state-free micro-tasks (f(D_in, S) = D_out) that can be routed interchangeably to human workers or AI agents. Every mutation is hash-chained into an immutable ledger, so every downstream decision — and every model trained on the data — has a traceable lineage back to the operation, the executor, and the spec that produced it.

The Opportunity

The data labeling layer is crowded, but provenance is still treated as a logging concern, not infrastructure. Enterprise AI teams need decision-level provenance (who decided what, under which spec, and how confident was the consensus), a portable spec/operation registry that survives vendor swaps, and a quality control regime that holds under both human and agent execution. Compliance regimes (EU AI Act, sectoral privacy law) are about to make this non-optional.

Approach

Wedge: decision-level provenance for labeling and evaluation pipelines, plus a versioned, portable spec/operation registry. Reference architecture: idempotent micro-task primitives, executor-abstraction layer, hash-chained mutation ledger in Postgres first (external anchor optional later), stochastic consensus via Gold/Poison/Greenfield task distribution and Blind Duplicate Routing, dynamic worker rating with exponential decay, trip-wires and expert triage for ambiguous cases. Anchored in Conner's production AI/ML work (KineticCRM 151+ tests, EcoMetrics 105+ dbt gates) and Mercor-class data annotation context.

Focus Areas

•Idempotent micro-task decomposition and operation registry
•Cryptographic mutation lineage and client-facing provenance trees
•Stochastic consensus and statistical QC (gold/poison/greenfield, blind duplicates)
•Executor-abstracted labor: human and agent on the same rails
•Compliance-grade audit trails for AI training data

Technologies

PythonFastAPIPostgreSQLHash-chain ledgerLLMs & agentic workersEvaluation harnesses

Work

Services

Thinking