History books may well view voice control as the most important advance made in the Human Machine Interface. No more typing, no more pointing, we just say what we want. Initial progress in this area limped along until the advent of smart speakers, when we started to realize what might be possible. Now the race is on with improvements in recognition, features and applications in phones, headsets, hearables and the smart home. The most widely known solutions today depend on platforms and services controlled by a small number of providers, but that is changing. Voice activation can be embedded anywhere, with customization, improved noise immunity, lower power, longer range and yet be just as effective as the big platforms in speech recognition.
The consumer audio market, where this capability plays an important part, has an interesting history. FutureSource shows that from 2008 to 2012, dollar volume declined as audio experiences consolidated primarily on smartphones. From 2012 to 2014, the market remained essentially flat. Then from 2015 through 2018, it grew again at a CAGR of 15%, driven primarily by voice activation. Looking forward, Yole Développement anticipates a minimum of 30% CAGR through 2023, driven predominantly by speech recognition. The bulk of this growth will continue to be in smartphones, followed by headsets and hearables, personal assistants and smart home features (TVs, appliances, etc.). The same report concludes that we are now entering a second phase in smart audio, where voice control will become much more pervasive, as consumers become more comfortable with this method of control.
Wherever they are deployed, the goal is to enhance differentiation. In a smartphone or any other battery-operated device, an obvious advantage is to support always-on listening; no need to push a button before you give a command. This requires ultra-low power trigger-word detection, which as we know means hardware with closely matched software so as to minimize standby power. Naturally, you want to personalize trigger words or phrases for your brand, and in multiple languages, in order to get strong penetration in your region and perhaps in the international market as well. You might still pass subsequent commands to one of the main voice recognition providers to unpack the request. Or perhaps not. If your appliance only needs support for a limited vocabulary, you might not need help from a third party, if your speech recognition engine can be stretched to that goal.
Another critical need is recognition and perhaps authentication, in a noisy environment. Voice recognition presents different challenges than exist in object recognition. In a living room or a car for example, there can be multiple sound sources: people talking, TV and independent music/radio sources, interior and exterior noise and echoes of all these from surfaces in a room or a car’s interior. Isolating the source of a command, cancelling echoes and reducing background noise requires some sophisticated technology depending on multiple microphones, beamforming and echo cancellation, along with noise suppression.
Those are the needs and naturally, available solutions such as CEVA’s are ready to address those needs. Solutions such as the recently introduced CEVA WhisPro™ phrase recognition product use neural-net-based software running on CEVA DSP platforms. WhisPro already supports “Alexa” and “OK Google” as voice triggers and it can be customized in training to support any customer-requested triggers. It supports multi languages and can handle multiple voice triggers. Training is performed with multiple noise backgrounds, so recognition has built-in noise immunity, delivering >95% recognition, and false acceptance of less than 1 per hour, without the need for cloud verification.
By adding a specialized voice-pickup solution, CEVA ClearVox™, developers can achieve multi-microphone support and beamforming for improved far-field voice pickup, along with echo cancellation and further noise reduction. Pairing WhisPro together with ClearVox delivers competitive trigger recognition at better distance (up to 7 meters), especially in noisy environments.
Published on Embedded.com
You might also like
More from Audio / Voice / Speech
Imagine you’re at the airport calling a friend. There are conversations going on all around you, planes taking off/landing, dozens …