
Turning Up The Volume
Covid-19 has made electronic-voice communication the norm, creating a spike in the amount of data that regulated firms have to monitor for compliance. In this article we chart the rise and rise of voice transcription as a way to identify risk in audio communication.
This article was featured in Issue 5 of GRIP Magazine, Global Relay’s exclusive publication focusing on Technology, Risk, and Compliance.
The pandemic and consequential widespread adoption of remote work have forced financial institutions to reconsider where their operational risks lie. Electronic-voice communication across a range of sources and devices, as opposed to in-person meetings, has become the norm. This has led to a substantial increase in the volume of voice data that firms have to monitor.
The traditional approach of randomly picking a voice file and listening to it has become ineffective, so transcription models have become an important component in identifying risk arising from audio communications (aComms), because it is much more efficient to auto-monitor and surveil terabytes of text than voice.
A brief history of voice transcription
Voice transcription is not a new technique. In the 1950s and 1960s, telephone companies transcribed small snippets of non-continuous speech, such as digits spoken over an analog telephone line, as part of research into automatic speech recognition. During this period we also saw the first probabilistic “phoneme” (smallest unit of speech) models appear, in which probabilities were calculated for a phoneme given the previous one.
These techniques were limited in their application because they did not address temporal issues such as the variable length of spoken words and phonemes that blend into each other.
The 1970s and 1980s saw the development of the first Markov language models (which words come after which and with what probability), which led to increased vocabularies and advance in search algorithms like beam search, which selects the most likely phrase from the user’s dictionary. These concepts still exist in more advanced forms in today’s transcription models.
The 1990s and 2000s saw the reanimation of neural network techniques thanks to gains in computational performance. These techniques learn temporal speech patterns and outperform the acoustic-to-word-or-phenome matching systems that had been developed in previous decades.
Towards the end of the 2000s transcription word error rates reached a point where voice transcription became commercially viable on an enterprise scale in real-world scenarios. Modern transcription techniques (sometimes referred to as automatic speech recognition) have built on these ideas and solved some of the more challenging problems like incorporating context and managing the temporal alignment of speech and words.
.An important modern advancement in this domain has been the socialization of labelled training data and the rise of advanced, open source models. Both of these factors have led to a surge in research and development that has driven word transcription error rates for both quiet and noisy scenarios to levels that rival human transcription.
However, a modern voice compliance and supervision system has many moving and evolving parts that all affect the quality of business insight that such a system should deliver.
From an engineering perspective, having a readily scalable and performant source and transcription architecture is key. The volume of voice data has rapidly increased during the pandemic and, along with it, the risk this content contains. A long time lag between voice origination and monitoring is not useful. There are moral and ethical considerations: voice files and transcriptions relate to people so they need to be stored and managed securely. As potentially non-compliant words are transcribed and associated with an individual, it’s essential to ensure good quality transcription through an auditable process of continuous learning to correct words and speakers.
Transcription quality is affected by factors from the choice of microphone to the language being spoken and whether any accents are present. Some languages, like English, have an abundance of supervised training data while other less common languages do not, which affects transcription quality. Some people and words may require more accurate transcription than others, which can be achieved by sampling canned phrases.
Putting compliance into context
Often data scientists focus on transcription quality because they can manage some of these factors. But in the broader setting of compliance and supervision, transcription only tells half the story.
Context is as vital in audio compliance as in e-communication compliance. Knowing who said what, when, to whom, using which device, and from where adds a great deal of behavioral context. An easy metric to calculate from this data is who the major participants in a Zoom meeting are and whether there is an individual that constantly interrupts others. The raw voice and transcription data can be analyzed to assess that person’s emotional state, which offers insights for business functions outside of compliance and supervision.
Having a secure, scalable, and performant engineering platform to generate high quality, contextually-rich transcriptions gives organizations the ability to identify where the business risk is in their audio content. Without all of these pieces in place, they remain in the realm of random sampling and the uncertainty and doubt that brings.
GRIP offers a unique blend of perspectives for corporates and regulated entities on the latest developments that impact technology, risk and compliance.