Clinical AI · 11 min read · 2026-05-17
How Multi-Speaker Diarization Solves Group Therapy Charting Blockages
Group therapy, DBT skills groups, and couples work all have the same documentation problem: who said what, when, and with what affect. We built SignalEHR's ambient scribe to solve this in real time for up to 8 speakers — here's how the pipeline actually works, and where the privacy guardrails live.
The problem
Standard AI scribes are built around the one-therapist / one-client assumption. They diarize into two channels and call it done. In a real group setting that breaks down fast:
- A DBT skills group typically has 6–10 members plus a co-leader. A two-channel scribe collapses them into "Speaker 2" and notes become useless for tracking per-member skill acquisition or homework completion.
- Couples therapy needs separate emotion scores per partner — the whole point of EFT or Gottman work is the dyad dynamics. A single-channel emotion score hides the asymmetry.
- Family sessions add the "identified patient" tracking problem: who's the focus, who's the supporting cast, and which family member said the thing that shifted the conversation?
None of this is solved by transcription alone. It needs diarization (who spoke when) plus identification (which person each "who" maps to in your client roster) — and both have to be reliable enough that the clinical note auto-attributes accurately.
The pipeline, step by step
Here's what happens when you start a group session in SignalEHR. The whole loop runs in under 200 ms per turn, so the therapist sees real-time speaker labels in the live transcript.
Step 1
Pre-session voice enrollment (one-time per client)
Each client records a 15-second voice sample at intake (consented, stored as a 192-dimensional speaker embedding — not the raw audio). The embedding is computed with the same Pyannote speaker-embedding model we use at inference time, so the comparison metric stays consistent. Re-enrollment is offered every 6 months because voice characteristics drift with age, hormones, illness, or major life events.
Step 2
Live audio capture (16 kHz mono PCM)
The session audio is captured at 16 kHz mono PCM — high enough for speaker discrimination, low enough to stream cheaply. For in-person groups we recommend a single omnidirectional condenser mic at the center of the table. For telehealth groups we use the per-participant Daily.co audio tracks (no mixing on the client side), which gives near-perfect channel separation before we even reach the diarization step.
Step 3
Streaming transcription (Deepgram Nova)
Audio chunks are streamed to Deepgram with diarize=true and multichannel=true (for telehealth). Deepgram returns word-level timestamps and an initial speaker label (Speaker 0, Speaker 1, …). Sub-second latency, so the therapist sees words in the live transcript as they're spoken.
Step 4
Speaker re-identification (Pyannote)
Every 3-second window of audio gets a 192-dim speaker embedding extracted via Pyannote's speechbrain/spkrec-ecapa-voxceleb model. We compare each window's embedding against the enrolled embeddings for the session's expected attendees using cosine similarity. The match with the highest score above the 0.65 threshold wins; below threshold, the segment stays as an anonymous speaker label.
Step 5
Identity propagation across turns
Once a speaker window is matched to a known client, that identity is propagated to overlapping Deepgram speaker labels for the same turn. This bridges Deepgram's acoustic clustering (which is consistent within a session but not across) with our roster-based identity (which is stable but sparse). The result: the therapist sees real names in the live transcript, not Speaker 3.
Step 6
Per-speaker emotion + clinical signal extraction
For each speaker turn we extract: pitch contour, speaking rate, energy, MFCC-13, and (post-call) a sentiment score from the transcript. These get mapped to per-speaker clinical indices: emotional variability, agitation, withdrawal, and (when relevant) suicide ideation flags. In couples therapy these indices roll up into Gottman's four horsemen detection (criticism, contempt, defensiveness, stonewalling) — per partner, per turn.
Step 7
Modality-aware note drafting
At session end the structured transcript (speaker-attributed turns + per-speaker emotion timeline) feeds into the modality-specific note engine: a DBT skills-group note tracks module coverage and per-member homework status; an EFT couples note tracks the negative interaction cycle; a family session note tracks the identified-patient focus and sub-system boundaries. The therapist reviews and signs the draft — Signal never publishes a note without human review.
What we actually track per speaker
For every identified speaker in a group or couples session:
- Floor time — total seconds spoken, % of session. Useful for noticing the quiet member or the dominant one.
- Turn count + average turn length — surfaces monologuing vs. clipped, withdrawn responses.
- Emotional variability score — standard deviation of valence across the session. Low variability with low affect often correlates with depression-flavored withdrawal; high variability with high arousal correlates with crisis states.
- Interruption count — bidirectional. Helpful in couples work where the pursue/withdraw pattern shows up as asymmetric interruption.
- Topic alignment — semantic similarity of this speaker's turns to the group's primary topic cluster. Low alignment can indicate a side-conversation or an unrelated personal disclosure.
- Per-member clinical flags — suicide ideation, substance reference, abuse disclosure, medication non-adherence. Each is timestamped to the exact second of the session so the therapist can pull the audio segment for consultation.
Why this beats a stitched-together stack
A common workaround is to record the session in Zoom, run it through a generic transcription tool, paste the result into an AI scribe like Otter.ai, then manually attribute speakers in the EHR. That stack has three failure modes:
- Speaker labels reset every session in any transcription tool that isn't identity-aware. The therapist re-labels "Speaker 1" → "Sarah" every time. Across a 12-week DBT cohort that's 12 × 5 minutes of busywork per group leader.
- Emotion signals get averaged across all speakers when the tool only outputs a single channel. The couple that's in crisis looks fine on paper because partner A compensates for partner B's low affect.
- Audio crosses tool boundaries, multiplying BAA surface area. Every additional tool needs its own BAA and creates another place where PHI can leak.
Single-platform diarization solves all three: speaker identity persists across sessions, emotion signals are per-speaker by default, and the audio never leaves SignalEHR's HIPAA-covered infrastructure.
Where the privacy boundary lives
Speaker embeddings sound creepy until you understand what they are. An embedding is a 192-dimensional vector — a fingerprint for a voice, much like a face embedding. From an embedding you cannot reconstruct the original audio. You can't even reliably identify a speaker outside the closed roster you enrolled them against.
- Embeddings are per-clinic. A client enrolled at Clinic A is not identifiable at Clinic B. Each clinic's embedding store is logically isolated in our infrastructure and the comparison ML never sees cross-clinic data.
- Audio is not stored. Raw audio lives in memory long enough to transcribe (single-digit seconds) and is dropped immediately after. Only the transcript + speaker-attribution + extracted clinical signals are persisted to the chart.
- Clients can opt out. Voice enrollment is consented at intake. If a client declines, group notes still get transcribed but stay anonymously labeled (Speaker 3, etc.) — the diarization works, the identification doesn't.
- BAAs cover the whole pipeline. Deepgram (STT) and our Pyannote inference host both have signed BAAs. Nothing in the audio path runs on a vendor without one.
- PIPEDA + HIPAA aligned. Embeddings are treated as identifiable health information under both regimes — same access controls, audit log, and retention rules as the chart itself.
The honest limitations
This is a hard problem and we don't pretend it's solved:
- Crosstalk drops accuracy. When three people talk at once, our identification accuracy on the overlap drops from ~96% to ~78%. The transcript still captures the words; the attribution gets fuzzier.
- Sick or congested voices. A bad cold can shift a speaker's embedding far enough that we fall below threshold and label the segment as unknown. Re-enrollment fixes it but isn't automatic.
- Identical twins. Genuinely a hard case for acoustic embeddings. We hand-flag the session and prompt the clinician to verify.
- More than 8 speakers. We cap at 8 because that's where group dynamics + acoustic resolution stop being reliable. Larger groups (psychoeducational classes, etc.) can still be transcribed — they just don't get per-speaker attribution.
What this looks like in your chart
A finished DBT skills-group note in SignalEHR includes, for each member who attended:
- Their participation summary (floor time, engagement, mood)
- The specific skills they verbally engaged with (cross-referenced against the day's module curriculum)
- Their homework-review status (volunteered? completed? skipped?)
- Any per-member clinical flags raised during the session
- Auto-populated continuity to the next session's prep card
The therapist edits, signs, and the per-member sections flow into each client's individual chart automatically — no copy-paste, no "Speaker 4 said X" ambiguity.
Want to see it on a real session?
Try SignalEHR free for 14 days — record one group session and see the per-member attribution land in the draft note.