

Your user interfaces should clearly indicate which languages the users may speak by such measures as a drop-down list of languages allowed to be used. You still need to understand in what locale your users will speak.

The Speech SDK supports automatic language detection, but it only detects one out of four locales specified at runtime. Please consider the following application/service design principles for better speech experiences.ĭesign UIs to match input locales: Mismatched locales will deteriorate accuracy. Best practices to improve system accuracyĪs discussed above, acoustic conditions, such as background noise, side speech, distance to microphone, speaking styles and characteristics can adversely affect the accuracy of what is being recognized. If you specified English - United States (en-US) for an audio input, but a user spoke Swedish, for instance, accuracy would be worse.ĭue to these acoustic and linguistic variations, customers should expect a certain level of inaccuracy in the output text when designing an application. Mismatched locales: Users may not speak the languages you expect. Very particular accents may also lead to a mis-transcription. If a word appears in the audio that doesn't exist in a model, it will result in a mis-transcription.Īccents: Even within one locale (such as "English - United States"), many people have different accents. However, users may speak company specific terms and jargon (which are "out of vocabulary"). Vocabularies: The Speech-to-Text model has knowledge of wide variety of words in many domains. Also, other speakers may speak in the background while the main user is speaking. Overlapped speech: There could be multiple speakers within range of an audio input device, and they may speak at the same time.

Noise may come from the audio devices used to make a recording, while audio input itself may contain noise, such as background noise and environmental noise. Non-speech noise: If an input audio contains a certain level of noise, it has an impact on accuracy. Both cases may adversely impact Speech-to-Text accuracy. For example, if users stay far from a microphone, the input quality would be too low, while speaking too close to a microphone would cause audio quality deterioration. However, voice quality may be degraded by the way users speak to microphones, even if they use high-quality microphones. The unified speech models have been created based on various voice audio device scenarios, such as telephones, mobile phones and speaker devices. There are many factors that lead to a lower accuracy in transcription.Īcoustic quality: Speech-to-Text enabled applications and devices may use a wide variety of microphone types and specifications. Please check the list of supported locales for more details. This must match with actual language spoken in an input voice. However, you need to specify a language (or locale) for each audio input. Therefore, you don't need to consider using different models for your application or feature scenarios. Speech-to-Text uses a unified speech recognition machine learning model to understand what is spoken in a wide range of contexts and topic domains, such as command-and-control, or dictation and conversations.

Transcription Accuracy and System Limitations To understand the detailed WER calculation, please see this document about Speech-to-Text evaluation and improvement. WER counts the number of incorrect words identified during recognition, then divides by the total number of words provided in correct transcript (often created by human labeling). The industry standard to measure Speech-to-Text accuracy is Word Error Rate (WER). Non-optimal settings may lead to lower accuracy. This requires proper setup for an expected language used in the audio input and spoken styles. Speech-to-Text recognizes what's spoken in an audio input and generates transcription outputs.
