Written Dr Pradeep Kishore
Infosys BPM experts discuss the role of audio annotation in building intelligent voice-operated devices like chatbots and virtual assistants.
No one bats an eyelid when Alexa recounts the value of pi accurately up to a hundred digits or when she directs us to the nearest pizza joint. But we are floored when she comes up with some whacky responses to bizarre questions, such as:
‘Alexa, who’s your best friend?’
‘I have a really strong connection to your Wi-Fi!’
How is a virtual assistant aware that ‘pi’ is not a food item or a consumer good? Or read ‘joint’ in the right context? How does it go beyond semantics to detect humour, satire, sarcasm, and wit and deliver case-appropriate responses to open-ended questions?
Alexa, Siri, Cortana, and all other voice assistants owe their near-human-like qualities to audio annotation — a next-generation data annotation technique that makes ‘listening machines’ so intelligent that we even assign a gender to them.
What is audio annotation?
Audio annotation is a subset of data labelling. Audio annotators add metadata to audio recordings, making them machine-readable and fit for training natural language programming (NLP) systems. The recorded sounds are derived from diverse sources such as human speech, animals, vehicles, musical instruments, or the environment. Data engineers minutely segregate and label audio files and describe them by adding critical semantic, phonetic, morphological, and discourse data. The annotated audio is then fed into the NLP application being trained, enabling it to make sense of the sounds.
Why do we need audio data annotation?
Audio data annotation is vital for industries that deploy chatbots, virtual assistants, and voice recognition systems to automate workflows and enhance the overall customer experience. These bots need to make sense of human speech or voice commands across multiple situations and respond appropriately to them.
Let us consider a common scenario — an irate customer is enquiring about delays in delivery with a grocery chain voice assistant. The bot cannot directly enumerate the reasons for the delay. First, it must tender an apology and then end the conversation with an offer to compensate for the delay. In such instances, it means that the machine learning (ML) model must not only identify language but also factor in dialect, intonation, emotion, and speaker demographics. It must not only answer questions but also recognise the speaker’s intentions, address emotions, and suggest viable solutions.
Types of audio annotation
Audio annotation services vary with the AI model being trained. Broadly speaking, audio annotation falls into one of six categories:
1. Speech-to-text transcription: The recorded audio, including speech, sounds, and punctuation, is converted to text. This is crucial in auto-generating transcripts and captions for businesses, or in technologies enabling users to control their devices with voice commands.
2. Audio classification: Audio classification works on segregating voices from sounds. This type of annotation is important when the AI model needs to distinguish the human voice from ambient noise, such as the sounds of traffic, machines, or a downpour.
3. Speech labelling: It isolates specific sounds from others in an audio file and tags them with keywords. This annotation type is used in developing chatbots for specific, repetitive tasks.
4. Natural language utterance: As the term suggests, it is about annotating the minutest details of natural human speech, such as dialect, intonation, semantics, context, and emotion. This process is crucial in building interactive bots and virtual assistants.
5. Music classification: Annotators mark the music genres, instruments, and ensembles. This is useful in organising music libraries and auto-suggesting recommendations for music lovers.
6. Event tracking: Event tracking is a technique to segregate and annotate sounds generated from multisource conditions — for instance, the sounds of a busy street, where each sound component is rarely heard in isolation. This is crucial in use cases where the user has little or no control over sound sources.
Leveraging growth through audio annotation
The NLP market, which is already significant, is slated to grow at a compound annual growth rate (CAGR) of 25% in the next few years, drawing in over $43 billion in revenue by 2025. The efficiency of NLP-based applications depends directly on their annotation quality — the better the annotation, the more intelligent the machine is.
Whether for customer service chatbots, GPS navigation systems, voice-activated speakers, or security systems with sound recognition, are crucial in building machines that not only ‘listen’ but also respond, empathise, entertain, guide, or counsel. Hence the need for high-quality audio annotation tools and services in the market.
For organisations on the digital transformation journey, agility is key in responding to a rapidly changing technology and business landscape. Now more than ever, it is crucial to deliver and exceed organisational expectations with a robust digital mindset backed by innovation.
We help client data science teams build high-quality ‘training data’ for AI at scale, using a platform plus human-in-the-loop service model. It helps focus on strategic priorities like refining and improving the AI model.
Articles under 'Fortune India Exchange' are either advertorials or advertisements. Fortune India's edit team or journalists are not involved in writing or producing these pieces.