Online Conferences and Automatic Transcripts

I went to a very well-managed remote conference last week – the Collaborative Journalism Summit (CJS).

The CJS organizers used Zoom Webinar for the conference and Zoom Meetings for networking breaks throughout the day. They used the chat channel for instructions and comments and the Q&A channel for questions for speakers. A moderator organized and asked the Q&A channel questions, then, after a speaker’s time was up, the speaker could continue to answer questions in the Q&A channel.

Behind the scenes, they made sure each speaker was ready 15 minutes ahead of time, and if a speaker had trouble during a presentation a technician would jump in with instructions or remote help.

They also used some other tools: a Google doc for live notetaking, Otter.ai for live automatic speech recognition (ASR) transcripts of all the talks, and a Padlet “Asks + Offers” board.

The follow-up email I got this week contained not only all the conference materials, as promised, but a link to a Medium article detailing how they pulled off such an excellent remote conference.

Automatic Transcripts

I’ve been thinking about the automatic transcripts part, because I think we’re going to see a lot more remote conferences with automatic speech recognition baked in. I appreciate that automatic transcriptions and conferencing software are becoming integrated and that the Collaborative Journalism Summit folks connected to Otter to do the transcription. But I think there’s more we can do to improve the automatic transcription experience.

Automatic transcriptions have errors, and will continue to have errors for some time. Even 99% accurate means one wrong word in every hundred. Try writing a 500-word piece and allowing someone to replace five words randomly. Dial down to 97% accurate and you’ve got 15 wrong words.

A transcript with a few wrong words is good enough to be trouble. It’s easy to absorb information you don’t know you’re absorbing. So even if you’re looking through a transcript to find something else, you might absorb some information that a speech recognition error has changed, for instance, a flipped can/can’t, which is a fairly common error because “can” and “can’t” sound so much alike. Try saying them out loud a few times each.

There’s no such thing as a spellcheck for speech recognition because the software can’t tell that it’s made an error – you already have its best guess. Today’s best speech recognition software is still going to get “can” and “can’t” mixed up every so often and it’s not gonna know which instances it has wrong.

Spellcheck Rethink

So we need to rethink spellcheck for speech recognition.

I think there’s a simple way to make correcting automatic transcriptions more practical and efficient. Get the computer to mark words that are commonly misrecognized by automatic speech transcription. Can/can’t and does/doesn’t would be on that list.

Write a note at the top of the file saying “This document has been automatically transcribed by computer. The computer has also marked words that are easily misrecognized by a computer.”

If common mistake flagging is used with software that allows users to click on a sentence in the transcription to hear that sentence, users could easily listen and verify or correct flagged words. There may still be errors in the document beyond what’s flagged, but this would be an efficient way to remove some of the most dangerous errors.

The bells and whistles version would have a default list of common words to flag, and also a way for users to plug in a custom list of words.

There’s already some precedent for this. The SpeechMatics ASR service lets users provide a list of words that are likely to be in the transcription, and that improves the recognition.

This all falls under one of my favorite process tenants – tapping computers to do what computers are good at and humans to do what humans are good at. Here’s hoping that as we increasingly tap automatic speech recognition we can also keep down the trouble it might cause.