Stokoe Caption Transport Specification (CTS)

1. Purpose

The Caption Transport Specification (CTS) defines an event-driven, source-agnostic transport layer for real-time captions and acoustic context. CTS is designed to carry incremental speech text, sound events, and timing metadata from one or more recognition sources to multiple consumers (UI panes, history buffers, storage, export, analytics) with low latency, traceable provenance, and stable failure modes.

CTS treats captions as first-class time-series events, not as an incidental output of an AI response system.

2. Goals

Low-latency incremental display (typewriter-like behavior).
Source-agnostic ingestion (local ASR, cloud ASR, hybrid).
Deterministic commit and history semantics (no silent retroactive rewrites).
Event-driven integration with sound classification (clink, laugh, overlap, etc.).
Multiple consumers (NOW pane, History, Cinema, logs) without tight coupling.
Auditable provenance (which engine produced which tokens at what time).
Graceful degradation when ASR stalls or audio is unavailable.
Explicit boundary handling (pause events, VAD gates, turn boundaries).

3. Non-goals

Defining or prescribing a specific ASR model or vendor.
Speaker diarization requirements (supported as optional extension).
Natural language post-processing (punctuation/grammar) beyond optional enrichment.
UI layout and styling (CTS specifies contracts, not presentation).

4. Terminology

Frame: A chunk of audio samples processed as a unit.
Envelope: A short-term energy summary over frames (often used for visual chips).
VAD: Voice Activity Detection.
NOW: The immediate, streaming display region (ephemeral).
History: The durable transcript buffer (scrollable / persisted).
Candidate: An unfinalized text segment that may be revised.
Commit: A durable acceptance of content into History.
SoundEvent: A non-speech acoustic classification event (clink, laugh, etc.).
Transport Event: A CTS event emitted on the caption bus.
Producer: An engine emitting CTS events (local ASR, cloud, SID).
Consumer: UI/persistence logic subscribing to CTS events.

5. Architecture

5.1 Overview

CTS is a publish/subscribe event bus. Producers emit events; consumers subscribe and derive views and persistence.

Producers

Local ASR (Apple Speech)
Cloud ASR (Realtime transcription, batch, etc.)
SID/Sound classifier (SoundEvent + envelope chips)
Optional: diarizer, language model enrichment

Core

CaptionBus (in-process)
CaptionReducer (optional): deterministic state updates from events
CommitGate: rules for when candidates become commits

Consumers

NOW pane renderer
History buffer
Inspector/metrics
Cinema/conference view
Exporters (JSON, SRT/VTT, plaintext, PDF later)

5.2 Determinism rule

Consumers MUST be able to reconstruct derived state from an ordered event log:

Same input event sequence → same derived state.

6. Event Model

All events share a common header, plus typed payload.

6.1 Common Header (required)

{
  "event_id": "uuid",
  "type": "caption.delta | caption.commit | vad.state | sound.event | transport.status | error",
  "ts_event_ms": 0,
  "ts_audio_ms": 0,
  "source": {
    "id": "local.appleSpeech | cloud.realtime | sid",
    "kind": "asr | sid | diarizer | enrichment",
    "version": "string",
    "session_id": "string"
  },
  "seq": 0
}

Fields

event_id: unique identifier (UUID).
type: event type enum.
ts_event_ms: monotonic time when emitted.
ts_audio_ms: audio timeline reference (best-effort).
source: provenance.
seq: monotonically increasing integer per source (strict ordering).

6.2 Event Types

6.2.1 `caption.delta`

Incremental text updates for NOW display.

{
  "type": "caption.delta",
  "payload": {
    "segment_id": "uuid",
    "range": { "start": 0, "end": 12 },
    "text": "hello worl",
    "tokens": [
      {"t":"hello","conf":0.92,"ts_audio_ms":1234},
      {"t":"worl","conf":0.70,"ts_audio_ms":1450}
    ],
    "is_partial": true,
    "stability": 0.0,
    "channel": "primary",
    "language": "en-US"
  }
}

Rules

Deltas MAY revise previously emitted text for the same segment_id.
Consumers MUST treat deltas as ephemeral unless committed.
range describes where this delta applies within segment (optional; can be full replace).
stability is optional heuristic (0.0–1.0). Higher means less likely to revise.

6.2.2 `caption.commit`

Durable acceptance into History.

{
  "type": "caption.commit",
  "payload": {
    "commit_id": "uuid",
    "segment_id": "uuid",
    "text": "Hello world.",
    "tokens": [
      {"t":"Hello","conf":0.94,"ts_audio_ms":1234},
      {"t":"world.","conf":0.93,"ts_audio_ms":1450}
    ],
    "final": true,
    "commit_reason": "pause | vad_end | explicit | time_limit",
    "span": { "ts_audio_start_ms": 1200, "ts_audio_end_ms": 1800 }
  }
}

Rules

A commit MUST NOT silently delete prior commits.
A commit MAY supersede uncommitted delta content for the same segment_id.
Commit is the authoritative payload for History.

6.2.3 `vad.state`

VAD state transitions for gating and UX.

{
  "type": "vad.state",
  "payload": {
    "state": "inactive | active",
    "confidence": 0.0,
    "ts_audio_ms": 0
  }
}

6.2.4 `sound.event`

Non-speech sound classification.

{
  "type": "sound.event",
  "payload": {
    "sound_id": "uuid",
    "label": "clink | laugh | applause | cough | knock | music | overlap | unknown",
    "strength": 0.0,
    "duration_ms": 0,
    "features": {
      "centroid_hz": 0.0,
      "bandwidth_hz": 0.0,
      "rms": 0.0
    },
    "ui_hint": {
      "glyph": "🥂",
      "pattern": "glass | metal | soft",
      "cooldown_ms": 600
    }
  }
}

Rules

Sound events MUST be emitted even when ASR is offline (if audio pipeline is running).
Consumers MAY show a top-left “pill” alert for these events.

6.2.5 `transport.status`

Connectivity / health events.

{
  "type": "transport.status",
  "payload": {
    "state": "starting | running | degraded | stopped",
    "details": "string",
    "active_sources": ["local.appleSpeech","sid"],
    "latency_ms": 0
  }
}

6.2.6 `error`

Errors with explicit codes.

{
  "type": "error",
  "payload": {
    "code": "audio_buffer_too_small | auth_failed | source_disconnected | decode_failed",
    "message": "string",
    "source_id": "cloud.realtime",
    "recoverable": true
  }
}

7. Ordering, Time, and Replay

7.1 Ordering

Each source MUST provide strict seq ordering.
The bus MUST merge events by (ts_event_ms, source.seq) for a global order.
Consumers MUST tolerate small clock skew across sources.

7.2 Time bases

ts_event_ms is monotonic wall time in the app process.
ts_audio_ms references audio stream time (sample-count derived).
If ts_audio_ms is unknown, it MAY be set to -1 and consumers should degrade gracefully.

7.3 Replay

The system SHOULD support recording the event log and replaying it for debugging:

Deterministic reproduction of UI behavior
Latency attribution
Regression tests

8. State Model

8.1 Session

A CTS session begins when audio capture is started and ends when capture stops. Sources may attach/detach within a session.

Session phases:

starting: audio pipeline initializing
running: at least one ASR source emitting deltas or commits
degraded: audio running but ASR stalled/disconnected
stopped

8.2 Segments

Segments represent contiguous speech content for a single channel (default: primary).

Segment lifecycle:

open (receiving deltas)
committed (commit event emitted)
closed (no longer receiving deltas; may create next segment)

Segment boundary triggers:

pause (detected silence above threshold)
vad_end (active→inactive)
time_limit (segment too long)
explicit (user action)

9. CommitGate Rules (Normative)

CommitGate decides when to emit caption.commit.

9.1 Primary triggers

A commit SHOULD occur when any of the following holds:

Pause trigger: silence duration ≥ PAUSE_MS and there is uncommitted text.
VAD end: VAD transitions active→inactive and there is uncommitted text.
Time limit: OPEN_SEGMENT_MAX_MS exceeded.
Explicit flush: user/system requests flush (e.g., switching sources).

9.2 Guard conditions

A commit MUST NOT be emitted if:

Uncommitted text is empty or whitespace.
The most recent audio buffer duration < MIN_COMMIT_AUDIO_MS (prevents “commit_empty”).
A cooldown window is active to prevent commit storms.

9.3 Recommended defaults

MIN_COMMIT_AUDIO_MS: 120ms (>= vendor min; adjust per engine)
PAUSE_MS: 350–600ms (tunable)
OPEN_SEGMENT_MAX_MS: 6–10s
COMMIT_COOLDOWN_MS: 250–500ms

9.4 Switching sources

When switching ASR sources (e.g., cloud→local):

Emit transport.status update.
CommitGate SHOULD flush any open segment with commit_reason="explicit".

10. NOW Pane Contract (Normative)

NOW pane is a rendered view of the open segment, optimized for immediacy.

10.1 Display rules

NOW MUST display the latest caption.delta content for the current open segment.
NOW SHOULD update at token-level granularity when possible.
NOW MAY show interim instability (corrections), but MUST differentiate uncommitted status (e.g., subtle styling, caret).

10.2 Aging-out rule (critical)

When content scrolls/ages out of NOW:

The aged-out lines MUST be appended to History only if they have been committed, OR if a controlled policy allows “provisional history” (disabled by default).

Default: History only contains commits.

11. History Contract (Normative)

History is the durable transcript and MUST be internally consistent.

11.1 Write model

History is append-only by default.
Each append corresponds to a caption.commit.
Each entry MUST include:
- commit_id, source, ts_audio span, text, optional tokens.

11.2 Corrections policy

Corrections MAY occur only by explicit operations:

history.amend (future extension), which MUST reference prior commit_id and preserve audit trail.

No silent rewrite.

12. SoundEvent UX Contract (Recommended)

When a sound.event occurs:

UI SHOULD display a transient “pill” alert (top-left of NOW pane).
Pill animation MAY encode:
- strength → amplitude
- duration → animation length
- label/pattern → iconography
A cooldown SHOULD prevent spam (cooldown_ms).

SoundEvents SHOULD also be optionally logged in History as bracketed events:

Example: [clink] or [laughter] aligned near ts_audio_ms.

13. Buffering and Audio Ingestion

13.1 Ring buffer (recommended)

Maintain a circular buffer of recent audio frames to support:

waveform/FFT visualization
alignment of captions with audio time
replay windows for debugging

13.2 Minimum buffer constraints

Engines with commit-based APIs may require minimum audio durations. CTS MUST enforce MIN_COMMIT_AUDIO_MS to avoid empty commits.

14. Source Integration Requirements

14.1 ASR sources

An ASR source MUST:

Emit caption.delta and/or caption.commit.
Provide monotonically increasing seq.
Provide source metadata (id, version, session_id).
Provide best-effort ts_audio_ms.

14.2 SID source

SID MUST:

Emit sound.event with strength and duration.
Optionally emit “envelope chips” as a separate event type (future: sound.envelope).

14.3 Enrichment sources (optional)

Enrichment sources MUST NOT overwrite commits. They may:

emit parallel channel (channel="enriched")
suggest punctuation
suggest entity normalization

15. Observability & Metrics

CTS implementations SHOULD expose:

event throughput (events/sec)
ASR latency (audio→delta, audio→commit)
stall detection (time since last delta/commit)
VAD duty cycle
SoundEvent counts by label
history growth rates

Minimum required logs:

status transitions
errors with codes
source connect/disconnect

16. Error Handling (Normative)

16.1 Classification

Errors MUST be categorized with:

code
recoverable
source_id

16.2 User-visible behavior

Recoverable errors SHOULD:

transition status to degraded
keep UI running
show explicit markers: e.g., “ASR stalled” (do not fake captions)

Non-recoverable errors SHOULD:

stop affected source
keep other sources running if possible

17. Privacy & Data Handling

Default mode SHOULD be local-first.
Cloud sources MUST be opt-in and explicitly identified in provenance.
Event logs SHOULD be stored locally unless user exports.
If exporting, provenance and timestamps MUST be preserved.

18. Conformance Levels

Level A (Minimum viable CTS)

caption.delta
caption.commit
transport.status
basic ordering with seq
NOW + History via commits

Level B (Enhanced)

vad.state gating
sound.event
stall detection
replay logging

Level C (Advanced)

multi-channel (primary/enriched)
diarization extension
explicit history amendments with audit trail

19. Test Plan (Acceptance)

19.1 Functional

Typewriter test: speech produces deltas continuously without waiting for pause.
Pause commit test: pause triggers commit; history appends once.
No empty commits: short buffers never cause commit_empty errors.
Source switch: switching engines flushes open segments properly.
SoundEvent pill: pill appears on sound events with cooldown.
Stall handling: ASR stall marks degraded state; no hallucinated output.

19.2 Stress

Background noise + speech
Media playback (audiobook) + ambient sounds
Long sessions (30–120 minutes)
Rapid start/stop cycles

19.3 Determinism

Record event log, replay, verify identical History output and segment boundaries.

20. Extensions (Reserved)

Potential future event types (non-breaking):

caption.segment.open / caption.segment.close
sound.envelope (chips)
speaker.tag (diarization)
history.amend (explicit edits with audit trail)
export.request / export.done

Appendix A: Minimal Swift Protocol Sketch (Non-normative)

Informative only; not required for conformance.

CaptionBus publishes TransportEvent
Producers implement TransportProducer
Consumers implement TransportConsumer or subscribe via Combine/AsyncStream

Appendix B: Example Human-readable History Rendering

10:41:12.340 Hello world.
10:41:13.120 [clink]
10:41:14.002 Can you hear me okay?

1. Purpose

2. Goals

3. Non-goals

4. Terminology

5. Architecture

5.1 Overview

5.2 Determinism rule

6. Event Model

6.1 Common Header (required)

6.2 Event Types

6.2.1 caption.delta

6.2.2 caption.commit

6.2.3 vad.state

6.2.4 sound.event

6.2.5 transport.status

6.2.6 error

7. Ordering, Time, and Replay

7.1 Ordering

7.2 Time bases

7.3 Replay

8. State Model

8.1 Session

8.2 Segments

9. CommitGate Rules (Normative)

9.1 Primary triggers

9.2 Guard conditions

9.3 Recommended defaults

9.4 Switching sources

10. NOW Pane Contract (Normative)

10.1 Display rules

10.2 Aging-out rule (critical)

11. History Contract (Normative)

11.1 Write model

11.2 Corrections policy

12. SoundEvent UX Contract (Recommended)

13. Buffering and Audio Ingestion

13.1 Ring buffer (recommended)

13.2 Minimum buffer constraints

14. Source Integration Requirements

14.1 ASR sources

14.2 SID source

14.3 Enrichment sources (optional)

15. Observability & Metrics

16. Error Handling (Normative)

16.1 Classification

16.2 User-visible behavior

17. Privacy & Data Handling

18. Conformance Levels

Level A (Minimum viable CTS)

Level B (Enhanced)

Level C (Advanced)

19. Test Plan (Acceptance)

19.1 Functional

19.2 Stress

19.3 Determinism

20. Extensions (Reserved)

Appendix A: Minimal Swift Protocol Sketch (Non-normative)

Appendix B: Example Human-readable History Rendering

6.2.1 `caption.delta`

6.2.2 `caption.commit`

6.2.3 `vad.state`

6.2.4 `sound.event`

6.2.5 `transport.status`

6.2.6 `error`