Stokoe Caption Transport Specification (CTS)
An event-driven, source-agnostic transport layer for real-time captions and acoustic context
1. Purpose
The Caption Transport Specification (CTS) defines an event-driven, source-agnostic transport layer for real-time captions and acoustic context. CTS is designed to carry incremental speech text, sound events, and timing metadata from one or more recognition sources to multiple consumers (UI panes, history buffers, storage, export, analytics) with low latency, traceable provenance, and stable failure modes.
CTS treats captions as first-class time-series events, not as an incidental output of an AI response system.
2. Goals
- Low-latency incremental display (typewriter-like behavior).
- Source-agnostic ingestion (local ASR, cloud ASR, hybrid).
- Deterministic commit and history semantics (no silent retroactive rewrites).
- Event-driven integration with sound classification (clink, laugh, overlap, etc.).
- Multiple consumers (NOW pane, History, Cinema, logs) without tight coupling.
- Auditable provenance (which engine produced which tokens at what time).
- Graceful degradation when ASR stalls or audio is unavailable.
- Explicit boundary handling (pause events, VAD gates, turn boundaries).
3. Non-goals
- Defining or prescribing a specific ASR model or vendor.
- Speaker diarization requirements (supported as optional extension).
- Natural language post-processing (punctuation/grammar) beyond optional enrichment.
- UI layout and styling (CTS specifies contracts, not presentation).
4. Terminology
- Frame: A chunk of audio samples processed as a unit.
- Envelope: A short-term energy summary over frames (often used for visual chips).
- VAD: Voice Activity Detection.
- NOW: The immediate, streaming display region (ephemeral).
- History: The durable transcript buffer (scrollable / persisted).
- Candidate: An unfinalized text segment that may be revised.
- Commit: A durable acceptance of content into History.
- SoundEvent: A non-speech acoustic classification event (clink, laugh, etc.).
- Transport Event: A CTS event emitted on the caption bus.
- Producer: An engine emitting CTS events (local ASR, cloud, SID).
- Consumer: UI/persistence logic subscribing to CTS events.
5. Architecture
5.1 Overview
CTS is a publish/subscribe event bus. Producers emit events; consumers subscribe and derive views and persistence.
Producers
- Local ASR (Apple Speech)
- Cloud ASR (Realtime transcription, batch, etc.)
- SID/Sound classifier (SoundEvent + envelope chips)
- Optional: diarizer, language model enrichment
Core
- CaptionBus (in-process)
- CaptionReducer (optional): deterministic state updates from events
- CommitGate: rules for when candidates become commits
Consumers
- NOW pane renderer
- History buffer
- Inspector/metrics
- Cinema/conference view
- Exporters (JSON, SRT/VTT, plaintext, PDF later)
5.2 Determinism rule
Consumers MUST be able to reconstruct derived state from an ordered event log:
Same input event sequence → same derived state.
6. Event Model
All events share a common header, plus typed payload.
6.1 Common Header (required)
{
"event_id": "uuid",
"type": "caption.delta | caption.commit | vad.state | sound.event | transport.status | error",
"ts_event_ms": 0,
"ts_audio_ms": 0,
"source": {
"id": "local.appleSpeech | cloud.realtime | sid",
"kind": "asr | sid | diarizer | enrichment",
"version": "string",
"session_id": "string"
},
"seq": 0
}Fields
event_id: unique identifier (UUID).type: event type enum.ts_event_ms: monotonic time when emitted.ts_audio_ms: audio timeline reference (best-effort).source: provenance.seq: monotonically increasing integer per source (strict ordering).
6.2 Event Types
6.2.1 caption.delta
Incremental text updates for NOW display.
{
"type": "caption.delta",
"payload": {
"segment_id": "uuid",
"range": { "start": 0, "end": 12 },
"text": "hello worl",
"tokens": [
{"t":"hello","conf":0.92,"ts_audio_ms":1234},
{"t":"worl","conf":0.70,"ts_audio_ms":1450}
],
"is_partial": true,
"stability": 0.0,
"channel": "primary",
"language": "en-US"
}
}Rules
- Deltas MAY revise previously emitted text for the same
segment_id. - Consumers MUST treat deltas as ephemeral unless committed.
rangedescribes where this delta applies within segment (optional; can be full replace).stabilityis optional heuristic (0.0–1.0). Higher means less likely to revise.
6.2.2 caption.commit
Durable acceptance into History.
{
"type": "caption.commit",
"payload": {
"commit_id": "uuid",
"segment_id": "uuid",
"text": "Hello world.",
"tokens": [
{"t":"Hello","conf":0.94,"ts_audio_ms":1234},
{"t":"world.","conf":0.93,"ts_audio_ms":1450}
],
"final": true,
"commit_reason": "pause | vad_end | explicit | time_limit",
"span": { "ts_audio_start_ms": 1200, "ts_audio_end_ms": 1800 }
}
}Rules
- A commit MUST NOT silently delete prior commits.
- A commit MAY supersede uncommitted delta content for the same
segment_id. - Commit is the authoritative payload for History.
6.2.3 vad.state
VAD state transitions for gating and UX.
{
"type": "vad.state",
"payload": {
"state": "inactive | active",
"confidence": 0.0,
"ts_audio_ms": 0
}
}6.2.4 sound.event
Non-speech sound classification.
{
"type": "sound.event",
"payload": {
"sound_id": "uuid",
"label": "clink | laugh | applause | cough | knock | music | overlap | unknown",
"strength": 0.0,
"duration_ms": 0,
"features": {
"centroid_hz": 0.0,
"bandwidth_hz": 0.0,
"rms": 0.0
},
"ui_hint": {
"glyph": "🥂",
"pattern": "glass | metal | soft",
"cooldown_ms": 600
}
}
}Rules
- Sound events MUST be emitted even when ASR is offline (if audio pipeline is running).
- Consumers MAY show a top-left “pill” alert for these events.
6.2.5 transport.status
Connectivity / health events.
{
"type": "transport.status",
"payload": {
"state": "starting | running | degraded | stopped",
"details": "string",
"active_sources": ["local.appleSpeech","sid"],
"latency_ms": 0
}
}6.2.6 error
Errors with explicit codes.
{
"type": "error",
"payload": {
"code": "audio_buffer_too_small | auth_failed | source_disconnected | decode_failed",
"message": "string",
"source_id": "cloud.realtime",
"recoverable": true
}
}7. Ordering, Time, and Replay
7.1 Ordering
- Each source MUST provide strict
seqordering. - The bus MUST merge events by
(ts_event_ms, source.seq)for a global order. - Consumers MUST tolerate small clock skew across sources.
7.2 Time bases
ts_event_msis monotonic wall time in the app process.ts_audio_msreferences audio stream time (sample-count derived).- If
ts_audio_msis unknown, it MAY be set to-1and consumers should degrade gracefully.
7.3 Replay
The system SHOULD support recording the event log and replaying it for debugging:
- Deterministic reproduction of UI behavior
- Latency attribution
- Regression tests
8. State Model
8.1 Session
A CTS session begins when audio capture is started and ends when capture stops. Sources may attach/detach within a session.
Session phases:
starting: audio pipeline initializingrunning: at least one ASR source emitting deltas or commitsdegraded: audio running but ASR stalled/disconnectedstopped
8.2 Segments
Segments represent contiguous speech content for a single channel (default: primary).
Segment lifecycle:
open(receiving deltas)committed(commit event emitted)closed(no longer receiving deltas; may create next segment)
Segment boundary triggers:
pause(detected silence above threshold)vad_end(active→inactive)time_limit(segment too long)explicit(user action)
9. CommitGate Rules (Normative)
CommitGate decides when to emit caption.commit.
9.1 Primary triggers
A commit SHOULD occur when any of the following holds:
- Pause trigger: silence duration ≥
PAUSE_MSand there is uncommitted text. - VAD end: VAD transitions active→inactive and there is uncommitted text.
- Time limit:
OPEN_SEGMENT_MAX_MSexceeded. - Explicit flush: user/system requests flush (e.g., switching sources).
9.2 Guard conditions
A commit MUST NOT be emitted if:
- Uncommitted text is empty or whitespace.
- The most recent audio buffer duration <
MIN_COMMIT_AUDIO_MS(prevents “commit_empty”). - A cooldown window is active to prevent commit storms.
9.3 Recommended defaults
MIN_COMMIT_AUDIO_MS: 120ms (>= vendor min; adjust per engine)PAUSE_MS: 350–600ms (tunable)OPEN_SEGMENT_MAX_MS: 6–10sCOMMIT_COOLDOWN_MS: 250–500ms
9.4 Switching sources
When switching ASR sources (e.g., cloud→local):
- Emit
transport.statusupdate. - CommitGate SHOULD flush any open segment with
commit_reason="explicit".
10. NOW Pane Contract (Normative)
NOW pane is a rendered view of the open segment, optimized for immediacy.
10.1 Display rules
- NOW MUST display the latest
caption.deltacontent for the current open segment. - NOW SHOULD update at token-level granularity when possible.
- NOW MAY show interim instability (corrections), but MUST differentiate uncommitted status (e.g., subtle styling, caret).
10.2 Aging-out rule (critical)
When content scrolls/ages out of NOW:
- The aged-out lines MUST be appended to History only if they have been committed, OR if a controlled policy allows “provisional history” (disabled by default).
Default: History only contains commits.
11. History Contract (Normative)
History is the durable transcript and MUST be internally consistent.
11.1 Write model
- History is append-only by default.
- Each append corresponds to a
caption.commit. - Each entry MUST include:
commit_id,source,ts_audio span,text, optionaltokens.
11.2 Corrections policy
Corrections MAY occur only by explicit operations:
history.amend(future extension), which MUST reference priorcommit_idand preserve audit trail.
No silent rewrite.
12. SoundEvent UX Contract (Recommended)
When a sound.event occurs:
- UI SHOULD display a transient “pill” alert (top-left of NOW pane).
- Pill animation MAY encode:
- strength → amplitude
- duration → animation length
- label/pattern → iconography
- A cooldown SHOULD prevent spam (
cooldown_ms).
SoundEvents SHOULD also be optionally logged in History as bracketed events:
- Example:
[clink]or[laughter]aligned nearts_audio_ms.
13. Buffering and Audio Ingestion
13.1 Ring buffer (recommended)
Maintain a circular buffer of recent audio frames to support:
- waveform/FFT visualization
- alignment of captions with audio time
- replay windows for debugging
13.2 Minimum buffer constraints
Engines with commit-based APIs may require minimum audio durations. CTS MUST enforce MIN_COMMIT_AUDIO_MS to avoid empty commits.
14. Source Integration Requirements
14.1 ASR sources
An ASR source MUST:
- Emit
caption.deltaand/orcaption.commit. - Provide monotonically increasing
seq. - Provide source metadata (
id,version,session_id). - Provide best-effort
ts_audio_ms.
14.2 SID source
SID MUST:
- Emit
sound.eventwith strength and duration. - Optionally emit “envelope chips” as a separate event type (future:
sound.envelope).
14.3 Enrichment sources (optional)
Enrichment sources MUST NOT overwrite commits. They may:
- emit parallel channel (
channel="enriched") - suggest punctuation
- suggest entity normalization
15. Observability & Metrics
CTS implementations SHOULD expose:
- event throughput (events/sec)
- ASR latency (audio→delta, audio→commit)
- stall detection (time since last delta/commit)
- VAD duty cycle
- SoundEvent counts by label
- history growth rates
Minimum required logs:
- status transitions
- errors with codes
- source connect/disconnect
16. Error Handling (Normative)
16.1 Classification
Errors MUST be categorized with:
coderecoverablesource_id
16.2 User-visible behavior
Recoverable errors SHOULD:
- transition status to
degraded - keep UI running
- show explicit markers: e.g., “ASR stalled” (do not fake captions)
Non-recoverable errors SHOULD:
- stop affected source
- keep other sources running if possible
17. Privacy & Data Handling
- Default mode SHOULD be local-first.
- Cloud sources MUST be opt-in and explicitly identified in provenance.
- Event logs SHOULD be stored locally unless user exports.
- If exporting, provenance and timestamps MUST be preserved.
18. Conformance Levels
Level A (Minimum viable CTS)
- caption.delta
- caption.commit
- transport.status
- basic ordering with seq
- NOW + History via commits
Level B (Enhanced)
- vad.state gating
- sound.event
- stall detection
- replay logging
Level C (Advanced)
- multi-channel (primary/enriched)
- diarization extension
- explicit history amendments with audit trail
19. Test Plan (Acceptance)
19.1 Functional
- Typewriter test: speech produces deltas continuously without waiting for pause.
- Pause commit test: pause triggers commit; history appends once.
- No empty commits: short buffers never cause commit_empty errors.
- Source switch: switching engines flushes open segments properly.
- SoundEvent pill: pill appears on sound events with cooldown.
- Stall handling: ASR stall marks degraded state; no hallucinated output.
19.2 Stress
- Background noise + speech
- Media playback (audiobook) + ambient sounds
- Long sessions (30–120 minutes)
- Rapid start/stop cycles
19.3 Determinism
- Record event log, replay, verify identical History output and segment boundaries.
20. Extensions (Reserved)
Potential future event types (non-breaking):
caption.segment.open/caption.segment.closesound.envelope(chips)speaker.tag(diarization)history.amend(explicit edits with audit trail)export.request/export.done
Appendix A: Minimal Swift Protocol Sketch (Non-normative)
Informative only; not required for conformance.
- CaptionBus publishes
TransportEvent - Producers implement
TransportProducer - Consumers implement
TransportConsumeror subscribe via Combine/AsyncStream
Appendix B: Example Human-readable History Rendering
10:41:12.340Hello world.10:41:13.120[clink]10:41:14.002Can you hear me okay?