
By listening to the audio and comparing recognition results in each column, which display the human-labeled transcription and the results for two speech-to-text models, you can decide which model meets your needs and determine where additional training and improvements are required. To inspect the side-by-side comparison, you can toggle various error types, including insertion, deletion, and substitution. This page lists all the utterances in your dataset and the recognition results of the two models, alongside the transcription from the submitted dataset. Select the test name to view the test details page. Side-by-side comparisonĪfter the test is complete, as indicated by the status change to Succeeded, you'll find a WER number for both models included in your test. Select up to two models that you want to test.Īfter your test has been successfully created, you can compare the results side by side. Give the test a name and description, and then select your audio + human-labeled transcription dataset. Select Speech-to-text > Custom Speech > Testing. To evaluate models side by side, do the following: A custom model is ordinarily compared with the Microsoft baseline model. The comparison includes WER and recognition results. If you want to test the quality of the Microsoft speech-to-text baseline model or a custom model that you've trained, you can compare two models side by side. Understanding issues at the file level will help you target improvements. Substitution errors are often encountered when an insufficient sample of domain-specific terms has been provided as either human-labeled transcriptions or related text.īy analyzing individual files, you can determine what type of errors exist, and which errors are unique to a specific file. Insertion errors mean that the audio was recorded in a noisy environment and crosstalk might be present, causing recognition issues. To resolve this issue, you need to collect audio data closer to the source. When many deletion errors are encountered, it's usually because of weak audio signal strength. How the errors are distributed is important. A WER of 30% or more signals poor quality and requires customization and training. A WER of 20% is acceptable, but you might want to consider additional training. A WER of 5-10% is considered to be good quality and is ready to use.

You can use the WER calculation from the machine recognition results to evaluate the quality of the model you're using with your app, tool, or product. If you want to replicate WER measurements locally, you can use the sclite tool from the NIST Scoring Toolkit (SCTK).

Insertion (I): Words that are incorrectly added in the hypothesis transcript.Incorrectly identified words fall into three categories: WER counts the number of incorrect words identified during recognition, divides the sum by the total number of words provided in the human-labeled transcript (shown in the following formula as N), and then multiplies that quotient by 100 to calculate the error rate as a percentage. The industry standard for measuring model accuracy is word error rate (WER). Audio + human-labeled transcription data is required to test accuracy, and 30 minutes to 5 hours of representative audio should be provided.


#SPEECH TO TEXT WINDOWS 10 FOR NOVELS HOW TO#
In this article, you learn how to quantitatively measure and improve the accuracy of the Microsoft speech-to-text model or your own custom models.
