Stokoe Caption Transport Specification (CTS)

Version: 0.9 (draft)
Date: 2025-12-27
Owner: Robert McConnell
Status: Draft / Implemented-in-parts

An event-driven, source-agnostic transport layer for real-time captions and acoustic context

1. Purpose

The Caption Transport Specification (CTS) defines an event-driven, source-agnostic transport layer for real-time captions and acoustic context. CTS is designed to carry incremental speech text, sound events, and timing metadata from one or more recognition sources to multiple consumers (UI panes, history buffers, storage, export, analytics) with low latency, traceable provenance, and stable failure modes.

CTS treats captions as first-class time-series events, not as an incidental output of an AI response system.


2. Goals

  1. Low-latency incremental display (typewriter-like behavior).
  2. Source-agnostic ingestion (local ASR, cloud ASR, hybrid).
  3. Deterministic commit and history semantics (no silent retroactive rewrites).
  4. Event-driven integration with sound classification (clink, laugh, overlap, etc.).
  5. Multiple consumers (NOW pane, History, Cinema, logs) without tight coupling.
  6. Auditable provenance (which engine produced which tokens at what time).
  7. Graceful degradation when ASR stalls or audio is unavailable.
  8. Explicit boundary handling (pause events, VAD gates, turn boundaries).

3. Non-goals

  • Defining or prescribing a specific ASR model or vendor.
  • Speaker diarization requirements (supported as optional extension).
  • Natural language post-processing (punctuation/grammar) beyond optional enrichment.
  • UI layout and styling (CTS specifies contracts, not presentation).

4. Terminology

  • Frame: A chunk of audio samples processed as a unit.
  • Envelope: A short-term energy summary over frames (often used for visual chips).
  • VAD: Voice Activity Detection.
  • NOW: The immediate, streaming display region (ephemeral).
  • History: The durable transcript buffer (scrollable / persisted).
  • Candidate: An unfinalized text segment that may be revised.
  • Commit: A durable acceptance of content into History.
  • SoundEvent: A non-speech acoustic classification event (clink, laugh, etc.).
  • Transport Event: A CTS event emitted on the caption bus.
  • Producer: An engine emitting CTS events (local ASR, cloud, SID).
  • Consumer: UI/persistence logic subscribing to CTS events.

5. Architecture

5.1 Overview

CTS is a publish/subscribe event bus. Producers emit events; consumers subscribe and derive views and persistence.

Producers

  • Local ASR (Apple Speech)
  • Cloud ASR (Realtime transcription, batch, etc.)
  • SID/Sound classifier (SoundEvent + envelope chips)
  • Optional: diarizer, language model enrichment

Core

  • CaptionBus (in-process)
  • CaptionReducer (optional): deterministic state updates from events
  • CommitGate: rules for when candidates become commits

Consumers

  • NOW pane renderer
  • History buffer
  • Inspector/metrics
  • Cinema/conference view
  • Exporters (JSON, SRT/VTT, plaintext, PDF later)

5.2 Determinism rule

Consumers MUST be able to reconstruct derived state from an ordered event log:

Same input event sequence → same derived state.


6. Event Model

All events share a common header, plus typed payload.

6.1 Common Header (required)

{
  "event_id": "uuid",
  "type": "caption.delta | caption.commit | vad.state | sound.event | transport.status | error",
  "ts_event_ms": 0,
  "ts_audio_ms": 0,
  "source": {
    "id": "local.appleSpeech | cloud.realtime | sid",
    "kind": "asr | sid | diarizer | enrichment",
    "version": "string",
    "session_id": "string"
  },
  "seq": 0
}

Fields

  • event_id: unique identifier (UUID).
  • type: event type enum.
  • ts_event_ms: monotonic time when emitted.
  • ts_audio_ms: audio timeline reference (best-effort).
  • source: provenance.
  • seq: monotonically increasing integer per source (strict ordering).

6.2 Event Types

6.2.1 caption.delta

Incremental text updates for NOW display.

{
  "type": "caption.delta",
  "payload": {
    "segment_id": "uuid",
    "range": { "start": 0, "end": 12 },
    "text": "hello worl",
    "tokens": [
      {"t":"hello","conf":0.92,"ts_audio_ms":1234},
      {"t":"worl","conf":0.70,"ts_audio_ms":1450}
    ],
    "is_partial": true,
    "stability": 0.0,
    "channel": "primary",
    "language": "en-US"
  }
}

Rules

  • Deltas MAY revise previously emitted text for the same segment_id.
  • Consumers MUST treat deltas as ephemeral unless committed.
  • range describes where this delta applies within segment (optional; can be full replace).
  • stability is optional heuristic (0.0–1.0). Higher means less likely to revise.

6.2.2 caption.commit

Durable acceptance into History.

{
  "type": "caption.commit",
  "payload": {
    "commit_id": "uuid",
    "segment_id": "uuid",
    "text": "Hello world.",
    "tokens": [
      {"t":"Hello","conf":0.94,"ts_audio_ms":1234},
      {"t":"world.","conf":0.93,"ts_audio_ms":1450}
    ],
    "final": true,
    "commit_reason": "pause | vad_end | explicit | time_limit",
    "span": { "ts_audio_start_ms": 1200, "ts_audio_end_ms": 1800 }
  }
}

Rules

  • A commit MUST NOT silently delete prior commits.
  • A commit MAY supersede uncommitted delta content for the same segment_id.
  • Commit is the authoritative payload for History.

6.2.3 vad.state

VAD state transitions for gating and UX.

{
  "type": "vad.state",
  "payload": {
    "state": "inactive | active",
    "confidence": 0.0,
    "ts_audio_ms": 0
  }
}

6.2.4 sound.event

Non-speech sound classification.

{
  "type": "sound.event",
  "payload": {
    "sound_id": "uuid",
    "label": "clink | laugh | applause | cough | knock | music | overlap | unknown",
    "strength": 0.0,
    "duration_ms": 0,
    "features": {
      "centroid_hz": 0.0,
      "bandwidth_hz": 0.0,
      "rms": 0.0
    },
    "ui_hint": {
      "glyph": "🥂",
      "pattern": "glass | metal | soft",
      "cooldown_ms": 600
    }
  }
}

Rules

  • Sound events MUST be emitted even when ASR is offline (if audio pipeline is running).
  • Consumers MAY show a top-left “pill” alert for these events.

6.2.5 transport.status

Connectivity / health events.

{
  "type": "transport.status",
  "payload": {
    "state": "starting | running | degraded | stopped",
    "details": "string",
    "active_sources": ["local.appleSpeech","sid"],
    "latency_ms": 0
  }
}

6.2.6 error

Errors with explicit codes.

{
  "type": "error",
  "payload": {
    "code": "audio_buffer_too_small | auth_failed | source_disconnected | decode_failed",
    "message": "string",
    "source_id": "cloud.realtime",
    "recoverable": true
  }
}

7. Ordering, Time, and Replay

7.1 Ordering

  • Each source MUST provide strict seq ordering.
  • The bus MUST merge events by (ts_event_ms, source.seq) for a global order.
  • Consumers MUST tolerate small clock skew across sources.

7.2 Time bases

  • ts_event_ms is monotonic wall time in the app process.
  • ts_audio_ms references audio stream time (sample-count derived).
  • If ts_audio_ms is unknown, it MAY be set to -1 and consumers should degrade gracefully.

7.3 Replay

The system SHOULD support recording the event log and replaying it for debugging:

  • Deterministic reproduction of UI behavior
  • Latency attribution
  • Regression tests

8. State Model

8.1 Session

A CTS session begins when audio capture is started and ends when capture stops. Sources may attach/detach within a session.

Session phases:

  1. starting: audio pipeline initializing
  2. running: at least one ASR source emitting deltas or commits
  3. degraded: audio running but ASR stalled/disconnected
  4. stopped

8.2 Segments

Segments represent contiguous speech content for a single channel (default: primary).

Segment lifecycle:

  • open (receiving deltas)
  • committed (commit event emitted)
  • closed (no longer receiving deltas; may create next segment)

Segment boundary triggers:

  • pause (detected silence above threshold)
  • vad_end (active→inactive)
  • time_limit (segment too long)
  • explicit (user action)

9. CommitGate Rules (Normative)

CommitGate decides when to emit caption.commit.

9.1 Primary triggers

A commit SHOULD occur when any of the following holds:

  1. Pause trigger: silence duration ≥ PAUSE_MS and there is uncommitted text.
  2. VAD end: VAD transitions active→inactive and there is uncommitted text.
  3. Time limit: OPEN_SEGMENT_MAX_MS exceeded.
  4. Explicit flush: user/system requests flush (e.g., switching sources).

9.2 Guard conditions

A commit MUST NOT be emitted if:

  • Uncommitted text is empty or whitespace.
  • The most recent audio buffer duration < MIN_COMMIT_AUDIO_MS (prevents “commit_empty”).
  • A cooldown window is active to prevent commit storms.
  • MIN_COMMIT_AUDIO_MS: 120ms (>= vendor min; adjust per engine)
  • PAUSE_MS: 350–600ms (tunable)
  • OPEN_SEGMENT_MAX_MS: 6–10s
  • COMMIT_COOLDOWN_MS: 250–500ms

9.4 Switching sources

When switching ASR sources (e.g., cloud→local):

  • Emit transport.status update.
  • CommitGate SHOULD flush any open segment with commit_reason="explicit".

10. NOW Pane Contract (Normative)

NOW pane is a rendered view of the open segment, optimized for immediacy.

10.1 Display rules

  • NOW MUST display the latest caption.delta content for the current open segment.
  • NOW SHOULD update at token-level granularity when possible.
  • NOW MAY show interim instability (corrections), but MUST differentiate uncommitted status (e.g., subtle styling, caret).

10.2 Aging-out rule (critical)

When content scrolls/ages out of NOW:

  • The aged-out lines MUST be appended to History only if they have been committed, OR if a controlled policy allows “provisional history” (disabled by default).

Default: History only contains commits.


11. History Contract (Normative)

History is the durable transcript and MUST be internally consistent.

11.1 Write model

  • History is append-only by default.
  • Each append corresponds to a caption.commit.
  • Each entry MUST include:
    • commit_id, source, ts_audio span, text, optional tokens.

11.2 Corrections policy

Corrections MAY occur only by explicit operations:

  • history.amend (future extension), which MUST reference prior commit_id and preserve audit trail.

No silent rewrite.


When a sound.event occurs:

  • UI SHOULD display a transient “pill” alert (top-left of NOW pane).
  • Pill animation MAY encode:
    • strength → amplitude
    • duration → animation length
    • label/pattern → iconography
  • A cooldown SHOULD prevent spam (cooldown_ms).

SoundEvents SHOULD also be optionally logged in History as bracketed events:

  • Example: [clink] or [laughter] aligned near ts_audio_ms.

13. Buffering and Audio Ingestion

Maintain a circular buffer of recent audio frames to support:

  • waveform/FFT visualization
  • alignment of captions with audio time
  • replay windows for debugging

13.2 Minimum buffer constraints

Engines with commit-based APIs may require minimum audio durations. CTS MUST enforce MIN_COMMIT_AUDIO_MS to avoid empty commits.


14. Source Integration Requirements

14.1 ASR sources

An ASR source MUST:

  • Emit caption.delta and/or caption.commit.
  • Provide monotonically increasing seq.
  • Provide source metadata (id, version, session_id).
  • Provide best-effort ts_audio_ms.

14.2 SID source

SID MUST:

  • Emit sound.event with strength and duration.
  • Optionally emit “envelope chips” as a separate event type (future: sound.envelope).

14.3 Enrichment sources (optional)

Enrichment sources MUST NOT overwrite commits. They may:

  • emit parallel channel (channel="enriched")
  • suggest punctuation
  • suggest entity normalization

15. Observability & Metrics

CTS implementations SHOULD expose:

  • event throughput (events/sec)
  • ASR latency (audio→delta, audio→commit)
  • stall detection (time since last delta/commit)
  • VAD duty cycle
  • SoundEvent counts by label
  • history growth rates

Minimum required logs:

  • status transitions
  • errors with codes
  • source connect/disconnect

16. Error Handling (Normative)

16.1 Classification

Errors MUST be categorized with:

  • code
  • recoverable
  • source_id

16.2 User-visible behavior

Recoverable errors SHOULD:

  • transition status to degraded
  • keep UI running
  • show explicit markers: e.g., “ASR stalled” (do not fake captions)

Non-recoverable errors SHOULD:

  • stop affected source
  • keep other sources running if possible

17. Privacy & Data Handling

  • Default mode SHOULD be local-first.
  • Cloud sources MUST be opt-in and explicitly identified in provenance.
  • Event logs SHOULD be stored locally unless user exports.
  • If exporting, provenance and timestamps MUST be preserved.

18. Conformance Levels

Level A (Minimum viable CTS)

  • caption.delta
  • caption.commit
  • transport.status
  • basic ordering with seq
  • NOW + History via commits

Level B (Enhanced)

  • vad.state gating
  • sound.event
  • stall detection
  • replay logging

Level C (Advanced)

  • multi-channel (primary/enriched)
  • diarization extension
  • explicit history amendments with audit trail

19. Test Plan (Acceptance)

19.1 Functional

  1. Typewriter test: speech produces deltas continuously without waiting for pause.
  2. Pause commit test: pause triggers commit; history appends once.
  3. No empty commits: short buffers never cause commit_empty errors.
  4. Source switch: switching engines flushes open segments properly.
  5. SoundEvent pill: pill appears on sound events with cooldown.
  6. Stall handling: ASR stall marks degraded state; no hallucinated output.

19.2 Stress

  • Background noise + speech
  • Media playback (audiobook) + ambient sounds
  • Long sessions (30–120 minutes)
  • Rapid start/stop cycles

19.3 Determinism

  • Record event log, replay, verify identical History output and segment boundaries.

20. Extensions (Reserved)

Potential future event types (non-breaking):

  • caption.segment.open / caption.segment.close
  • sound.envelope (chips)
  • speaker.tag (diarization)
  • history.amend (explicit edits with audit trail)
  • export.request / export.done

Appendix A: Minimal Swift Protocol Sketch (Non-normative)

Informative only; not required for conformance.

  • CaptionBus publishes TransportEvent
  • Producers implement TransportProducer
  • Consumers implement TransportConsumer or subscribe via Combine/AsyncStream

Appendix B: Example Human-readable History Rendering

  • 10:41:12.340 Hello world.
  • 10:41:13.120 [clink]
  • 10:41:14.002 Can you hear me okay?