Skip to main content

Epistemic Validation

This is the heart of Subak. No single verification signal tells you whether a system does what it's supposed to: a unit test checks a unit in isolation, an e2e test walks one path, a dependency scanner checks known-vulnerable surfaces, a type checker checks local properties. Each is necessary, partial, and blind to the others — and a stack of independently-green tools still doesn't compose into confidence about the system as a whole. Subak answers the harder, more useful question — "what holds, why does it hold, and what kind of evidence does that rest on?" — by epistemically aggregating all of those independent signals into one signed proof certificate that you (or a CI gate, or a third party) can verify without re-running anything.

A pile of green checkmarks is not a proof of anything. Confidence is something you have to compose.

For the command and flag specifics, see the CLI Reference; for the motivation in full, see Why Subak.

What subak validate proves

Running subak validate establishes five things about your artifact set:

  1. Schema conformance — every JSON artifact validates against its schema.
  2. Cross-reference integrity — every slug reference resolves to an artifact that exists.
  3. Orphan detection — no unreferenced personas, pain points, user goals, or product goals.
  4. Structural integrity — features have flows, flow steps are sequential, demo screenshots exist on disk, declared page routes match the app's router, declared CLI commands match the binary.
  5. Requirement coverage — application capabilities are captured as EARS requirements.

A passing run means, in the CLI's own words, "all artifacts pass schema validation, all cross-references resolve, and the artifacts directory is cohesive per the embedded validation rules." But the interesting part isn't the verdict — it's the artifact the verdict comes packaged in.

Checks versus proofs: the reframe

A check produces a verdict. A proof produces a certificate — a structured, machine-checkable artifact that encodes why the verdict holds: which inputs were examined, which rules were applied, and which prior results each rule depended on. A certificate can be verified by a third party without re-running anything.

That difference cashes out in four ways:

  • Certificates. The output is auditable, not opaque. You can trace any verdict to its evidence.
  • Provenance. Each node records the exact content hashes of the artifacts it read — which is what makes a verdict invalidatable when those inputs change.
  • Obligation generation. A proof system can be inverted: given partial artifacts and the constraint set, it can enumerate what must be created for a rule to pass — turning a failure into a to-do list.
  • Incremental re-checking. Because the tree records which results depend on which artifact hashes, a single edit re-evaluates only the affected nodes.

The proof tree

The certificate is a tree, and the tree is how Subak solves a problem no single tool solves: epistemically aggregating many independent verification signals into one justified claim about the whole system. No individual signal — a unit test, a type check, a schema validation, a dependency scan — yields confidence in the system on its own; each is partial and blind to the others. The proof tree composes them. The root is a single claim about the whole project; its descendants decompose that claim down to the mechanical evidence at the leaves, so that the green at the top genuinely aggregates the green underneath it rather than sitting beside it.

The tree distinguishes two kinds of node:

  • Conclusion nodes (the non-leaves) are authored catalog entries that make a claim. Each carries human-readable what / why / prevents text. A conclusion is atomic if its children are raw evidence, or composite if its children are other conclusions.
  • Primitive nodes (the leaves) carry one piece of evidence from one source — a file:line citation, a test count plus seed and counterexample, an LLM rationale. Primitives carry evidence, never claims.

The root claim is project.is_validated:

what: The project's declared intent is realized by an architecture whose code conforms to its artifacts and whose behavior matches its declared invariants. why: A project may compile and run while silently breaking what it was specified to do. prevents: Production bugs caused by code drifting from spec without anyone noticing.

This two-layer structure lets the same tree serve three audiences from three altitudes. A non-technical reader stops at the composite conclusions and reads business-language claims. An engineer drills into the atomic conclusions for the technical claim. A reviewer drills all the way into the primitives to inspect the mechanical evidence.

A worked example, rendered as a tree, makes the shape concrete:

project.is_validated ✓ [S,R,Q]
├─ architecture.is_sound ✓ [S,R] (section, composite)
│ └─ booking.is_correct (aggregate) ✓ [S,R] (per-artifact composite)
│ ├─ booking.no_mutation_on_reject ✓ [S] (atomic conclusion)
│ │ ├─ source.codespec ✓ (booking.rs:42) ← primitive leaf
│ │ └─ source.proptest ✓ (1000 cases, seed 0x4a7b1c, 0 shrinks)
│ └─ booking.invariants_hold ✓ [S,R]
│ └─ source.unit_test ✓ (12 passed)
├─ requirements.are_complete ✓ [S]
│ └─ ...
└─ features.are_implemented ✓ [S,R,Q]
└─ checkout.flow_is_coherent ✓ [Q]
└─ source.llm ✓ ("flow tells a complete, plausible story")

Notice the tags in brackets — that is the second half of the model.

S / R / Q: what kind of evidence is this?

Every conclusion carries a reasoning tag set, drawn from three letters:

  • S — statically derivable. Fed by primitive sources schema, prolog, codespec, and tooling. The strongest kind of evidence: derivable from the schema, the rules over the artifact graph, the code-to-artifact match, and language-level guarantees, without running anything.
  • R — runtime evidence. Fed by unit_test and proptest. The claim is evidenced by tests that actually ran, not proven from structure.
  • Q — qualitative review. Fed by llm. The claim rests on an LLM-as-judge reading the artifacts and forming an opinion.

There are seven primitive proof-source kinds feeding these tags — the seven independent signals the tree aggregates: schema (the JSON Schema validator), prolog (rules over the artifact graph), codespec (the "code matches the artifact" check — produced by the code generator on day one and an artifact-aware linter plugin on day N), tooling (language-guarantee facts), unit_test, proptest (property tests with shrinking), and llm. Each is partial on its own; the conclusions above them are where they get composed into claims about the system.

Children infect parents

The crucial property: a composite's tag set is the union of its children's tags. Tags propagate up the tree, so the root's tags expose the entire evidence surface of the verdict at a glance:

  • project.is_validated ✓ [S] means the entire verdict is pure static proof.
  • project.is_validated ✓ [S, R, Q] means some branch somewhere rests on runtime evidence and some branch somewhere rests on LLM judgment.

Tags are diagnostic, not categorical — they don't grade the verdict, they tell you what kind of knowledge it rests on. A green [S] root is a far stronger green than a green [S, R, Q] root, and the system refuses to hide that distinction.

:::warning The Q tier is judgment, not proof A conclusion tagged Q rests on an LLM review. That review can be wrong. Subak's discipline is to make that explicit — to surface the Q tag honestly so a human can weigh it — not to launder a judgment into something that looks like a proof. Treat a Q-only branch as an informed opinion, not a guarantee. :::

Justification discipline

Reinforcing the honesty: any atomic conclusion whose reasoning is not exactly {S} must carry a written justification explaining why it isn't purely statically derivable — a decidability gap, a language-guarantee gap, or a soundness gap. The discipline is to push back: a justification that reads "easier to test" is wrong, and the right response is to ask whether more could be declared in the schema to shift the claim up to a static source. The system is biased toward upgrading evidence into proof over time, rather than settling for the weakest source that happens to work.

Visual interface verification

Structural integrity confirms that your demo screenshots exist on disk — a path-existence check that says a file is present, nothing more. Layered on top of that is a deeper, vision-based step: a vision-capable LLM reviews the demo recordings of your running software to confirm the interface actually looks the way it should, rather than merely confirming a screenshot file is there. Existence is structural; "does the interface look right" is visual verification.

This visual check is gated on the GEMINI_API_KEY environment variable. When the key is present, the step runs and contributes its judgment to the proof tree as qualitative (Q) evidence — an opinion the system surfaces honestly, never a guarantee, as with any Q-tier review. When the key is absent, only the structural existence check applies.

note

Visual interface verification activates when GEMINI_API_KEY is set. It is part of a component still rolling out; without the key, validation falls back to confirming demo screenshots exist on disk and skips the deeper "does the interface look right" review.

Memoization: only recompute what changed

Validation memoizes every primitive verdict and replays it on the next run, so a single edit doesn't re-prove the whole project. The cache lives in the committed subak.proof.json and each verdict is keyed by:

ruleID + clauseHash + InputHashes + toolchainKey
  • ruleID — which rule.
  • clauseHash — the rule's own logic; change the rule body and its cached verdicts invalidate.
  • InputHashes — the content hashes of every artifact (and code file) the rule read. Touch one of those files and only the rules that read it recompute.
  • toolchainKey — the relevant tool versions.

Q-tier (LLM) rules additionally fold in the SHA-256 of the effective reviewAgent configuration from subak.json. So bumping the LLM model invalidates every cached Q-tier verdict, while bumping a single artifact still only invalidates the rules whose declared inputs mention that file.

The net effect: re-validating after one small edit is near-instant in the common case, which is what makes it practical to validate after every step rather than only at the end. The --cache-stats flag reports hits, misses, and per-rule recompute times — see the CLI Reference for the exact output and flags.

Validate is read-only

One property is load-bearing for the agentic story: subak validate does not write source code. Authoring schema-validates as you go; a separate, explicit step generates code; and validation only proves conclusions — it never overwrites a file. Generation-as-a-side-effect-of-validation would be a footgun: an agent fills in a stub body, validation overwrites it, work is lost. Day-N drift between code and artifacts is caught by checking, not by overwriting. That is what makes the validator a safe oracle: an agent can call it as often as it likes and its ground truth never mutates the work.

The certificate, signed and verifiable

Every run — pass or fail — produces a certificate with the same structure. A failed project gets a certificate that records what failed and why, in exactly the form a passing one would. The full certificate is content-addressed and stored out of the repo; the committed subak.proof.json is a lightweight index — {certificateHash, commitSHA, generatedAt, summary} — that gives code review a clear signal that a commit's proof state changed.

The certificate is signed, and the signing key plus the certificate schema are public, even though the rule implementations, the artifact content (hashes only), and the LLM prompts are not. This is the TLS model: a certificate authority's domain-validation process is private, but the certificate and its verification are public. Anyone can confirm the signature, confirm the hash matches the nodes, and confirm specific rules passed against specific artifact hashes. (The CLI's verify path checks exactly this — structural validity is independent of pass/fail.)

Honest limits

Subak is precise about what it does and does not prove. Keep these caveats in mind:

  • Most shipped nodes are instance proofs, not universal proofs. An instance proof says "this model was checked." A universal proof says "no valid model can violate this rule" — and those are solver-backed and largely still ahead of the shipped system. Don't read a green root as "universally proven everywhere"; read it as "this project, at this revision, checked out — and here is the evidence surface."

  • The Q (LLM) tier is judgment, not proof. As above: an LLM review is an opinion the system is honest about, not a guarantee.

  • Incremental re-evaluation is partial. The memoization framework is in place and the common single-edit case is fast, but full hot-path wiring of "re-evaluate only stale nodes" everywhere is still being completed. Treat fast incremental re-validation as the design intent that is largely — not yet universally — realized.

  • Runtime tiers depend on tests existing. R-tagged conclusions are only as strong as the unit_test / proptest evidence behind them; a claim with no tests to run can't be evidenced at runtime.

These limits are the point, not an embarrassment. The reasoning tags and justification discipline exist precisely so the certificate never overstates itself: it tells you, honestly, whether a verdict rests on proof, on evidence, or on judgment — and lets you calibrate your confidence accordingly.

Where to go next