In this blog, we will discuss the why and how of voice control deployment on low power and resource constrained Digital Signal Processors (DSPs) and micro-controller-units (MCUs) and its translation into real world product innovation.
But first, let’s define a couple of core concepts: HCI (Human computer interface), Voice User Interface, Voice Control:
- HCI (Human Computer Interface) is a well-defined concept that can be described as the point of communication between a human user and a computer. The communication channel classification can be based on many of the human senses: vision, hearing, touch, and so on.
- Voice User Interface (VUI) makes it possible for humans to communicate with machines using voice. Machines may employ some form of speech recognition to translate human speech to commands and queries.
- Voice Control is an implementation of a VUI, allowing a human to use simple, concise commands to operate a device or appliance.
VUIs have been around for a couple of years and have been made very popular over recent years thanks to devices such as Amazon Echo, Google Home, Apple Home Pod and their associated voice assistants deployed also on smartphones, TVs, cars and other devices.
Most of these devices rely on complex, cloud based, speech recognition engines. These engines handle complex human speech, allowing users to use natural language for interaction with machines.
However, these abilities come with a (many-fold) price tag:
- User privacy is compromised as user queries are uploaded to the cloud for processing, and are stored there for various lengths of time (from hours to months, depending on the service supplier)
- Device must have a connection to the cloud to operate
- Processing on the cloud is often energy consuming and slower
- Device BOM costs soar as relatively complex connectivity hardware must be integrated into the device, often resulting in major design modifications.
The price tags of a fully-fledged cloud-based voice assistants can be alleviated, for many use cases, by deploying a small, task optimized voice control engine on a battery operated, resource constrained, offline, DSP or MCU enabled device. Voice control powered by a small dedicated VUI engine can be realized on a simple DSP/MCU-based hardware module serving as a drop-in replacement for existing controls (knobs, buttons, touch screens, etc.).
While MCUs are usually capable enough to run a dedicated VUI engine, DSPs such as CEVA’s audio DSP BX family, will be much more efficient in doing so, resulting in lower power consumption and opening a window to adding additional AI based solutions to the device. In fact, a CEVA BX DSP core would consume up to 5x times less MCPS than a typical MCU.
Naturally, there are limitations to the capabilities of such a solution, but as we will shortly see, for many tasks and use cases, these limitations are outshined by the benefits.
The major limitation of voice control implementations for DSPs/MCUs, is that these are often characterized by a limited vocabulary support – only a small set of words can be recognized, words which the user must remember to operate the device properly. In other words, the user cannot use natural language, and make their request using the supported words and commands. For example, “play the next song” might not be recognized by a system configured to detect the command “next song” or even just “next”.
This limitation has a flip side – simplicity. Using short, concise commands, greatly reduces the risk of the device “misunderstanding” the command, due to ambient noise or other interruptions. This becomes very evident when considering the tasks voice control on DSP/MCU is designed to handle.
Let’s review some use-cases:
Many major appliances that have button\knob\touch interfaces are also operated with dirty or wet hands (ovens, hobs, washing machines, dishwashers). Voice control deployed on a DSP/MCU powered hardware module can prove to be very useful in keeping the appliance clean and easily operatable (have you ever tried to operate a touch interface with wet fingers?). From a manufacturing stand point, voice control deployed on a mass-produced DSP/MCU powered hardware module can serve as a drop-in replacement for existing buttons, knobs, and touch interfaces with minimal integration costs.
Robot Vacuum Cleaners
RVCs can operate independently or via remote controls (which always get lost…). A DSP/MCU Voice Control module supporting just a few commands (“clean kitchen”, “stop”, “go charge”) can significantly improve user experience, with a small impact on BOM and costs, while preforming better than a cloud-based voice assistant which often has difficulties with noisy environments and short commands.
Public Kiosks and Vending Machines
With Covid-19, Hygiene became a major concern, especially at the public domain. A DSP/MCU Voice Control module can provide an effective, low cost, option to upgrade existing machine catering for public health. Supported commands can be displayed\printed on the device to alleviate the lack of support for natural language while lowering error rates.
Wearables, Hearables and other Tiny Devices (TWS and Hearing Aids)
This device class is characterized by a limited power supply (small batteries, rendering continuous cloud connection impractical), limited compute resources (rendering large vocabulary speech recognition engines impractical) and limit surface space (rendering buttons and tap interfaces inconvenient) – which makes DSP/MCU power voice controlled an ideal solution.
IR Remote Control with voice control (for TVs, Home Entertainment and HVAC systems)
Remote control is the preferred interface for operating TVs, home entertainment systems, A/C system, ceiling fans and any device that is out of reach. Adding on-device VUI to remote controls allow better personalization (e.g. with speaker verification smart TV apps such as Netflix can be made to startup with user’s profile) and can also solve the “looking for the remote“ hassle. After market universal voice controlled remote controls can offer an easy upgrade for older systems.
What Makes Up a Good Voice Control Solution?
An DSP/MCU powered Voice control solution must address some key challenges to be considered as an efficient, effective and reliable alternative to existing interfaces (knobs, buttons, touch):
- Quality of Service – the probability that the voice control engine will “understand” (detect correctly) the uttered command or word. Two types of errors exist – False Accept and False Reject. User sensitivity to each type of error may vary with use case and the voice control engine must be tuned accordingly. In general, users would expect a True Acceptance Rate of 95% or higher, and no more than 1 False Accept per 24 hours. In other words, VUI performance should be such that a user would not bother reach for the remote or button.
- Noise robustness – ability to provide high quality detection in noisy environments all the cases reviewed earlier operate in (some are source of the noise). A good VUI implementation is expected to have a perceivable performance degradation at SNR levels lower than 5db.
- Power and Compute requirements – these are critical in determining if the candidate implementation is suitable for the use case. For battery operated implementations, power consumption should be in the milliwatt range. Such a VUI implementation should be able to run on a CEVA BX core consuming less than 10 MCPS and 80KB of memory, on a Cortex-M0+ or similar MCU it would consume less than 50MCPS and 80KB of memory.
- Security – an MCU voice control solution may be expected/required to respond selectively to commands issues by specific entities. This can be realized by speaker verification technology that can be integrated into the system.
VUI for MCUs Implementation Challenges
Building a competitive VUI engine is a game of balancing multiple (and often opposing) constraints:
- Quality of service (True Acceptance Rate vs False Accepts per Hour)
- Robustness to noise
- Robustness to reverberation
- Extremely limited compute and memory resources
- Robustness to accents
- Data acquisition costs
In deep learning research, a common way to boost model performance involves increasing model complexity and the amount of training data. Such techniques are not applicable in the “real world” where the goal is building a model (VUI engine in this case) targeting DSPs/MCUs which have very limited resources (model complexity must be kept to a bare minimum) in an economical fashion (data acquisition resources are limited).
The pressure set by the different constraints marked the beginning of a journey researching model size reduction techniques and advanced data engineering methods aimed at making the most of limited data acquisition resources.
In our path, various model size reduction techniques were researched:
- Post processing quantization and quantization aware training
- Structured and unstructured pruning
- Low-rank approximation and sparsity
- Knowledge distillation
While solving for compute and memory footprints, model performance was pushed ever higher by:
- Researching multiple audio signal processing techniques
- Researching multiple feature extraction techniques
- Testing different model architectures from CNNs to RNNs and transformers
- Researching and experimenting with a wide array of audio data engineering methods from effective and efficient data collection procedures to data augmentations and noise mixing parameters
Finally, when a satisfactory model architecture, data acquisition, and training recipes were realized, attention was focused on implementation challenges:
- Code portability and maintainability
- High performance and high accuracy fixed point arithmetic
- Multi-platform optimizations
- API simplicity and usability
If you’d like to witness the result of the above process as researched and implemented by CEVA, check out the WhisPro voice user interface – voice control technology.
You are welcome to contact CEVA to learn more.
Published on AudioXpress.
You might also like
More from Audio / Voice / Speech
Personalize This! – The Personalized User Experience
There is so much data being collected about each of us every day taken from the technology we use: where …