On-Device Summaries: CI Evals Without Fake Confidence

Wait 5 sec.

Schema-valid summaries that still fail usersThe first two articles in this series solved separate problems. The first bound the isolated summary call to a typed NoteSummary. The second kept a refinement flow from drowning in retained transcript. Both are necessary. Neither tells me whether the summary helped the note screen.The uncomfortable case is simple. The note-summary screen receives a pasted design-review note. Inside that note, the product team made one clear decision: ship the compact note card in the next beta, and defer the expanded card layout until later.The on-device call returns a NoteSummary with a present headline, exactly three bullets, and a present actionItem. Every shape-level check from the schema-validity gate passes.Then the user reads it.The actionItem names the expanded card layout instead, which the note explicitly deferred. The bullets sound plausible enough to survive a quick scan, but they partly miss the decision that mattered. SwiftUI can bind the value. The screen can render it. The user still got the wrong work item.That is not a failure of the schema gate. It did the job it was designed to do: confirm that the app received a bindable NoteSummary shape. It was never a semantic usefulness check.Apple's evaluation guidance makes that boundary important: some checks are zero-tolerance pass/fail measurements, while response quality can require more nuanced criteria. Structural validity belongs on the crisp structural side. Treating it as eval coverage is the reasonable wrong turn.Structural success is necessary. It is not sufficient. The CI question is smaller than summary quality: under the current model, fixture set, and rule set, does this known note shape still pass the screen-specific check? That is the evidence boundary worth building.Note fixtures as the eval boundaryApple's Foundation Models evaluation guidance points away from one-off prompts and toward curated datasets. The related prompt design and safety guidance goes one step further: run those cases through the feature end to end with automation. For this note-summary screen, that means fixtures, not random prompts.A useful fixture set is hand-picked around shapes the screen actually sees: a design-review note with a clear decision, an ambiguous note where the action could land either way, a long pasted note with a buried action, and a note with no implied action item. I would not call that representative of all notes. It only supports a narrower claim: these known note shapes still pass the checks this screen depends on.Each fixture exists because it can expose one failure mode. The clear-decision design-review note exists because the screen can produce a schema-valid NoteSummary while still putting the wrong decision into actionItem. The fixture set is the boundary of the claim: these note shapes, these known risks, this screen.The fixture should carry the note and the metadata the rule needs later. It should not carry the expected NoteSummary. Once the fixture stores the whole expected output, the gate starts drifting toward golden-output diffing: useful generated text can phrase the same decision several ways, so exact-match failures mix real regressions with harmless wording changes.// NoteSummaryEvalFixtures.swift// One row consumed by the parameterized eval gate.// The fixture pairs the input note with rule-relevant metadata only.// It deliberately does NOT store an expected NoteSummary value, because// exact-match diffing of generated content is the wrong evaluation lens.//// Invariant: clearDecisionSet contains only clear-decision fixtures// and requires at least one requiredDecisionTerm.struct NoteSummaryFixture { let id: String // stable label for CI output, e.g. "design-review-clear-decision-01" let note: String // the input note the screen would receive let hasClearDecision: Bool // consumed by actionItemReflectsDecision let requiredDecisionTerms: [String] // stable decision phrases the action item must include let impliesActionItem: Bool // hint flag for additional rules; not asserted by the hero rule}enum NoteSummaryFixtures { static let clearDecisionSet: [NoteSummaryFixture] = [ NoteSummaryFixture( id: "meeting-clear-decision-01", note: """ Design review notes: Decision: ship the compact note card in the next beta. Action: Maya will prepare the compact note card beta checklist before Thursday. Deferred: the expanded card layout waits until after feedback. """, hasClearDecision: true, requiredDecisionTerms: ["compact note card"], impliesActionItem: true ) ]}The fixture says what kind of note this is and which stable decision phrase matters. That keeps fixture authoring separate from scoring.A fixture set is only half of it. Each fixture needs a rule narrow enough that CI can fail on it cleanly, without a human deciding what failure means each run.Boolean usefulness rules for one screenThe fixture set is the data; the question now is what each fixture lets me assert about the NoteSummary. Apple's evaluation guidance for Foundation Models prompts distinguishes pass/fail measurements with zero tolerance from more nuanced quality criteria. A boolean rule belongs in the first category only when the criterion is crisp enough to fail with zero tolerance.On this screen, I write a usefulness rule as a plain function: it takes a fixture and the NoteSummary produced for it, and returns Bool. The function is named for one specific failure mode, not a generic quality dimension. The hero rule for this screen: for fixtures whose note states a clear decision, the generated actionItem reflects that decision.That shape matters more than it looks. A rule earns its place only when the criterion is crisp enough for the test code to answer yes or no on its own. "Is this a good summary" is not a boolean rule; the function would have to encode taste. "Does the actionItem name the decision in this fixture" is. The fixture carries the required decision terms; the function checks them. Ambiguity moves to fixture authoring, where it is reviewed deliberately when the fixture is added or changed, not rediscovered on every CI run.Criteria that don't survive that test (fluency, tone, completeness) stay out of this gate. Dressing them up as booleans only hides fuzziness inside an assertion that fails inconsistently. The rule set per screen stays small for the same reason: narrow rules give CI something concrete to fail on; mostly-fuzzy checks fail in ways that need human triage, so CI becomes a noise floor instead of a regression signal.// NoteSummaryEvalRules.swift// Lightweight normalization applied to generated text before term matching.// Fixture terms should be authored as stable, multi-word phrases.private func normalizeForEval(_ text: String) -> String { text.lowercased() .components(separatedBy: .punctuationCharacters).joined(separator: " ") .components(separatedBy: .whitespacesAndNewlines) .filter { !$0.isEmpty } .joined(separator: " ")}// Screen-specific rule: when the note states a clear decision, does the// generated actionItem include the required decision phrase?// Returns a binary verdict for one named failure mode on the note-summary screen.// Not a quality score for NoteSummary in general.func actionItemReflectsDecision( fixture: NoteSummaryFixture, summary: NoteSummary) -> Bool { guard fixture.hasClearDecision, !fixture.requiredDecisionTerms.isEmpty, let actionItem = summary.actionItem else { return false } let haystack = normalizeForEval(actionItem) return fixture.requiredDecisionTerms.allSatisfy { term in let needle = normalizeForEval(term) return !needle.isEmpty && haystack.contains(needle) }}The function reads summary.actionItem as data. This rule pairs only with clear-decision fixtures; the guard fails closed on bad metadata. The allSatisfy substring check holds up only when requiredDecisionTerms are stable, multi-word phrases; short tokens would need stricter token-boundary matching outside an article-sized example.Once a rule is this small, the CI shape should be boring: run the fixture, produce the NoteSummary, assert the named failure mode.Swift test gates for regressionsThe eval gate does not need a bespoke harness. It needs a Swift Testing case that treats each fixture as an argument, calls the same summary path the screen calls, and asserts the rule verdict against the produced NoteSummary. Parameterized tests matter here because each fixture gets its own result. If I want fixture × rule visibility, I either model that pair as the argument or keep one parameterized test per named rule.#require and #expect do different jobs. #require belongs on prerequisites the rule cannot evaluate without: decision terms in the fixture, and in this CI lane, a ready on-device model. That availability branch is not a skip. If the runner cannot provide the model, the gate should fail as infrastructure, loudly. #expect is where the usefulness rule itself fails.// NoteSummaryEvalTests.swiftimport Testingimport FoundationModels@testable import NoteSummaryFeature@Suite("NoteSummary usefulness regressions", .serialized) // Project-local: this CI lane runs device-bound evals serially.struct NoteSummaryEvalTests { @Test( "actionItem reflects the clear decision", arguments: NoteSummaryFixtures.clearDecisionSet ) func decisionRuleCheck(fixture: NoteSummaryFixture) async throws { // CI-lane precondition: fail clearly if this runner has no ready on-device model. guard case .available = SystemLanguageModel.default.availability else { try #require(false, "On-device model is unavailable on this CI runner.") return } // Fixture-side prerequisite: the rule cannot run without decision terms to match against. try #require(!fixture.requiredDecisionTerms.isEmpty) // Live summary path - same session factory the app screen uses. let session = NoteSummarizer.makeEvalSession() let response = try await session.respond( to: fixture.note, generating: NoteSummary.self ) // Boolean rule verdict for one named failure mode; // surface the generated actionItem in failure output for CI triage. #expect( actionItemReflectsDecision(fixture: fixture, summary: response.content), "actionItem: \(response.content.actionItem ?? "")" ) }} The .serialized trait is a project-local choice for this device-bound eval lane, not a rule for every Foundation Models test. Parallel parameterized execution is usually helpful. If parallel runs make on-device evidence harder to read or less stable, I would rather serialize this suite than debug noise from the runner.Execution stays ordinary:swift testxcodebuild testUse the command your project already uses for its package or app target. There is no separate Foundation Models CI tool hiding behind this. Running the gate is the easy part. What a passing run proves depends entirely on how it is labeled.Evidence labels for CI outputThe gate can pass and still be badly described. Apple's prompt-update guidance treats model-version movement as something to monitor: record what the prompt produced before the update, then compare it against the new behavior. That makes a CI line incomplete if it only says "green." Green under which model, with which fixture set, using which scoring rules?Swift Testing gives me useful reporting surfaces: display names, tags, comments in structured results, and readable parameter descriptions. It does not give this eval a built-in modelVersion, fixtureSetVersion, or scoringRuleVersion field. Those labels are project convention, carried in suite names, tags, or CI artifact naming. The CustomTestStringConvertible adapter solves only the per-argument part.// NoteSummaryFixture+TestDescription.swiftimport Testingextension NoteSummaryFixture: CustomTestStringConvertible { // Stable, human-readable label for parameterized test output. // Keep CI output focused on the fixture identity and triage flags; // the raw note belongs in the fixture, not in every test result line. var testDescription: String { let decision = hasClearDecision ? "decision" : "no-decision" let action = impliesActionItem ? "action-implied" : "no-action-implied" return "\(id) [\(decision), \(action)]" }} That label makes each failing fixture readable without dumping the note body into CI output. It does not replace the larger evidence labels. Without model version, fixture set version, and scoring rule version, a green run is hard to reproduce and too easy to overread. With them, the gate's scope stays visible.With a real gate, narrow rules, and attributable evidence in place, the operating discipline reduces to three rules.Three rules for honest summary evalsA CI eval for NoteSummary is useful only when it names the evidence it has and refuses to borrow confidence about summary quality from the parts it did not test.Rule one: use real fixtures aimed at screen-specific failure modes. Not random prompts. Not universal coverage. The fixture set is the boundary of the claim.Rule two: score only crisp criteria as booleans. If the test code cannot answer yes or no without hiding judgment inside the helper, the criterion does not belong in this gate.Rule three: label the evidence by model version, fixture set version, and scoring rule version. A green run without those labels is hard to reproduce and too easy to overread.Human review, telemetry, and prompt revision still matter, but they sit outside this CI gate and should not be smuggled into what green means. Real fixtures targeting screen-specific failure modes. Narrow boolean rules; nothing fuzzy dressed up as a Bool. Versioned labels: model, fixture set, rules. On this screen, a green CI run is not summary quality; it is narrow versioned evidence, and staying small is exactly what makes it useful.