Conversation summarization

Structured summaries of multi-speaker transcripts

Updated June 29, 2026

Conversation differs structurally from edited prose. A meeting transcript may contain interruptions, incomplete utterances, disfluencies, digressions, and decisions dispersed across multiple turns. A summarizer must infer discourse structure rather than recover an explicit structure supplied by the source.

Speaker-attributed input

A summary is only as good as the transcript under it, and conversations need more than the words. They need speaker attribution. "I'll send the contract by Friday" means something different depending on who said it, so the transcript should carry speaker labels and timestamps before summarization.

[00:12:04] Agent: So I'll waive the fee this once.
[00:12:09] Customer: And you'll email confirmation today?
[00:12:11] Agent: Yes, within the hour.

Transcript errors propagate with force here. A transcript that misattributes turns or mishears the numbers produces a summary that is confidently wrong about who agreed to what.

Decide what kind of summary you need

"Summarize this" is underspecified. Conversations support several outputs, and you should pick deliberately.

An abstractive summary rewrites the gist in new sentences ("The customer disputed a late fee; the agent waived it and promised email confirmation"). An extractive summary pulls the most important verbatim lines. Structured extraction pulls specific fields into a schema: action items, decisions, owners, due dates, questions. For most business uses, the structured output is the valuable one: after a meeting people want the specific commitments and decisions they have to act on, not a paragraph describing the meeting.

Generate structured summaries

Give a language model the speaker-labeled transcript and ask for exactly the structure you defined: a summary, action items with an owner, task, and due date, and a list of decisions. A schema rather than free prose forces the model toward the fields you use and makes the output checkable.

Conversations longer than the context window

A two-hour call can exceed what a model reads at once, so you cannot always pass the whole transcript. Split it into chunks, summarize each, then summarize those summaries. This is the map-reduce shape: a map step that summarizes every chunk independently, then a reduce step that summarizes the joined results into one.

Cut on topic or speaker-turn boundaries, never mid-sentence, because a fact's context is lost the moment it straddles a chunk edge. Carry a little overlap to soften the seams. For very long archives, segment by topic first and summarize per topic; blind chunking loses more.

A summary nobody can verify will not be trusted. Tie claims back to the audio: attach the timestamp ranges a summary point came from, so a user can jump to "the agent waived the fee" and hear it. Grounding turns each point into a claim with its source attached, so anyone who doubts it can check in seconds. This is the main defense against the failure in the next section.

flowchart LR A[Audio] --> B[Transcript<br/>+ speakers, times] B --> C[Summarize<br/>map-reduce if long] C --> D[Grounded notes<br/>+ action items]
The summarization pipeline. Speaker-attributed transcript in, grounded structured output out.

Common summarization errors

Every failure here has the same root: the summarizer reasons over a transcript, and both the transcript and the model can be wrong.

Asked to summarize a thin or garbled stretch, a language model writes fluent output anyway and states things the conversation did not contain. This is the hallucination problem moved up one level. A misheard dosage, name, or amount slides straight into a confident-sounding summary, the beyond-WER argument again: the few words that carry the meaning are the ones a transcript is most likely to miss. Left to itself, a summarizer also drifts toward the bland and safe ("they discussed the project"), a generic line that fits any meeting, and drops the one specific commitment or complaint that mattered most.

Grounding each point to a timestamp, preferring extracted verbatim text for high-stakes facts, and asking for owners and numbers in named fields rather than prose all push back, keeping the summary tied to what the transcript actually contains.

Real-time summarization

The sections above assume a finished recording. Summarizing a live call adds the constraints of streaming: you maintain a running summary that updates as the conversation unfolds, you work from partial transcripts that may still change, and you cannot see the end before you summarize the middle. Live summaries help with agent assist ("here is what the customer wants so far"), but they are provisional and get revised as the call continues, just like the recognition beneath them.

Common questions

Why is summarizing a conversation harder than summarizing an article?

Because an article was written to be read and a meeting was not. The one real decision sits buried under ten minutes of scheduling, three people interrupt each other, and there are no headings. The summarizer has to impose a structure that was never there and track who said what, since "I'll send the contract by Friday" means nothing without the speaker attached.

How are long calls summarized if they exceed the model's context?

Map-reduce: chunk the transcript, summarize each chunk, summarize those summaries. Cut on topic or speaker-turn boundaries, never mid-sentence, and overlap a little, because a fact's context vanishes the moment it straddles a chunk edge. For long archives, segment by topic first and summarize per topic, which loses less than blind chunking.

Why does my call summary sometimes state things that were not said?

Because the model writes fluently even where the transcript was thin, and any misheard name or amount slides into a confident-sounding line. Ground each point to a timestamp so a user can jump to the audio and check it, prefer extracted verbatim text for high-stakes facts, and start from an accurate transcript. The few words that carry the meaning are the ones recognition is likeliest to miss.

What is the most useful output from conversation summarization?

For most business uses, structured extraction, not a prose paragraph: action items with owners and due dates, decisions, open questions. People want the specific things they have to act on, not a description of the meeting. A schema forces the model toward those fields and makes the output checkable, where a generic summary buries them.

References

  1. Zhong, M., Yin, D., Yu, T., Zaidi, A., Mutuma, M., Jha, R., et al. (2021). QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. arXiv preprint arXiv:2104.05938.