For many years now, every new rev of speech recognition software has heralded increased accuracy rates: 15 percent more accurate or even 20 percent more accurate.
But when accuracy is already in the high 90s, 15 percent of the remaining few percentage points doesn’t feel like a whole lot. Even with accuracy rates in the 80s, 15 percent of 15 percent is fairly subtle. There’s also a danger of expecting the percentages to add up – every year it’s 15 percent more accurate and these announcements have been happening every year or two for 20 years so it must be 100 percent by now…
Accuracy rates also vary widely among different people – for many different reasons.
Computer speed and microphone quality make a big difference. Background noise affects recognition speed and accuracy a bit if you have a good noise-canceling microphone and more if you don’t. User habits, like protecting your speech profile by always turning off the microphone when it’s not in use, make a big difference. Maintaining vocabulary by training odd words that Dragon often gets wrong and taking the occasional look at custom vocabulary to weed out garbled words or capitalizations that shouldn’t be in there also makes a difference.
Computer maintenance also makes a difference – if you’re using a hard drive, keeping it defragmented is especially important for speech users, because speech will slow down if the speech profile, which is updated often, is fragmented, meaning split up and saved in different places on the hard drive. If you’re using a solid-state drive it’s important to keep it less than half full. It’s faster than a hard drive if it’s less than half full, but it slows down rapidly once it passes that mark.
The issues are a bit simpler with automatic transcription from an audio file. It’s all about how clean the recording is. If there’s a lot of background noise, or a buzz that makes it hard for you to hear, that’s going to affect how the automatic speech recognition engine interprets it as well.
There are several types of incremental improvements in the speech recognition technology that drives speech input and automatic transcription.
- The engine can improved to be faster and more accurate.
- The interface can be improved to make the experience simpler or more comfortable for users.
- The quality of the audio signal that the engine starts with can be improved.
- The quality of the information that the speech engine has at its disposal can be improved.
I was happy to see a simple improvement to the Speechmatics automatic transcription engine that falls into the fourth category.
The user can now submit a list of custom words along with an audio file. The words get plugged into the automatic transcription engine as likely vocabulary for that file. You’d use this for things like names, places and jargon, and it’s especially useful if you have a whole set of interviews that might have special names, places and jargon.
This is one of those incremental improvements that makes a small, positive difference in accuracy and also makes proofing more comfortable. There will be fewer corrections to make on an automatic transcription proof.
The Speechmatics engine is used in several different automatic transcription products. It powers Trint, and it was what Pop-up Archive used before the archive was taken down after being acquired by Apple.