Captions versus Transcripts:
It’s all about portions and waypoints

Captions reside at the bottom of a video and show what’s being said in short bits. Transcripts also show what’s said, but let you see more text at once.

At first glance, it might not seem like they’re much different. The words are all the same, after all. Look a little closer, though, and it becomes obvious that there are a couple of key differences that technology hasn’t yet neatly sorted out.

The differences, which boil down to portions and waypoints, became obvious as we built the InSite open source publishing system for Duke University’s Rutherford Living History Archive.

The premise is similar. Each snippet of text on a captioned video has a timestamp that tells it when to show up. Each sentence of text in an interactive transcript has a timestamp so users can navigate, search and share. Click the transcript and the video scrubs to the start of that sentence. Search for a word or phrase and hits are highlighted in the transcript and indicated by dots on the video seek bar. Click on a highlight or dot and the video scrubs to the start of that sentence. Select an transcript excerpt and the social media share or copy includes a link that scrubs to the first sentence of the excerpt in the video.

When we were looking for automatic formatting tools for transcriptions, captioning software was an obvious first place to look. But captions are generally done by time – they often start or end in the middle of a sentence. We needed to highlight the transcripts a discrete sentence at a time, and we wanted quote shares to start at the beginning of a sentence.

So the first difference between captions and transcriptions is that a transcript portion is a sentence, while a caption portion is a few seconds.

Captions also take on the temporal aspect of video. You see video and caption in lockstep. So captions don’t need to separately indicate who’s talking. And since you only see a bit of text at once, there’s no need to organize the text by paragraph.

In contrast, transcripts aren’t locked into the same timeframe as video. You can read a transcript along with a video, but you can also see more of the text at once and any part of it at any point. You can read ahead, go back and read something you already heard, and compare. This means the text needs organization. The WebVTT standard is a way of formatting text to connect it to audio or video so that a player can interpret it, and humans can also easily read it. WebVTT includes speaker and chapter codes for transcripts. But this wasn’t enough. Look at a big chunk of text at once and you need more waypoints: paragraphs and subheadings.

We bent the WebVTT standard a bit so we could provide these for the Rutherford Living History site. Fortunately, the standard contains the NOTE indicator, which ignores any line of text that starts with NOTE. We used this to make a couple of special tags that we could intercept to give us the functionality we wanted. ‘NOTE paragraph” indicates a paragraph, and “NOTE chapter” indicates a second layer of chapter –  a subhead.

This made our transcripts much more readable and useful – readers can navigate by subhead and chapter – as we continued to use the WebVTT standard. I’m hoping the folks who shepherd the standard can see their way to adding these concepts.