This is a research draft and while its aim is to benchmark APIs, it recommends tools that have not been benchmarked. So if you feel some tools would better fit, please reach out!

  • api.ai: untested -> paid plan required
  • Amazon voice service: untested -> german model unavailable (english only)
  • Nuance ASR: tested
  • Google Voice Service: test planned -> waiting for the use request approval
  • Microsoft Cognitive Services (formerly Project Oxford) Speech To Text API: untested -> unless you use their SDK (iOS, Android, C#), you cannot stream to their service and only use a REST API with no partial result streaming.

From uses cases to audio files

Would you want to replicate that little test, you need to use a few use cases to asses the variety of domains supported by the API. I focused on what my use cases were at the time: problem description with audio of varying length (15 to 256s, 6 use cases).

We used the following sequence for experiment purposes:

  • Delivered an audio file with recorded search phrases to external services
  • Received recognized text from automatic speech recognition service
  • Evaluated quality metrics of recognized text vs. actual search phrase

Converting audio files

Here I used the sox CLI tool, which stands for Sound eXchange. I actually just needed to convert a stereo 44.1kHz floating point mp3 file to 16kHz, merge it into a single mono file, convert the mono file to a signed PCM, process some filter on the raw files, convert them back to mono WAVs (that’s the api.ai requirements). To accomplish this, I used these commands, which were neither well documented, nor correctly referred to by most of the blogs I’ve read today, even five years later¹.

First, if you don’t know what your audio files are made of, use soxi:

sox command #1

Then, the basics. The following converts to wave, only keeps the left channel (channel 1, thus converting to mono if it wasn’t already) and to a sample rate of 16000kHz with the sox intents.

sox Example1.mp3 Example1.wav channels 1 rate 16k  

But the signed PCM² is still lacking.

sox Example1.mp3 -e signed-integer Example1.wav channels 1 rate 16k  

Problem #1: we have to take into account more information-rich files (especially stereo, even if it is unlikely with phone or laptop microphones). Here we keep stereo info by doing a mix-down of both channels, averaging them:

sox stereo.wav -c 1 mono.wav avg  

Problem #2: we have to reduce clipping. Clipping is distortion that occurs when an audio signal level (or ‘volume’) exceeds the range of the chosen representation. In most cases, clipping is undesirable and so should be corrected by adjusting the level prior to the point (in the processing chain) at which it occurs.

In SoX, clipping could occur, as you might expect, when using the vol or gain effects to increase the audio volume. Clipping could also occur with many other effects, when converting one format to another, and even when simply playing the audio.

For these reasons, it is usual to make sure that an audio file’s signal level has some ‘headroom’, i.e. it does not exceed a particular level below the maximum possible level for the given representation. Some standards bodies recommend as much as 9dB headroom, but in most cases, 3dB (≈ 70% linear) is enough. Note that this wisdom seems to have been lost in modern music production; in fact, many CDs, MP3s, etc. are now mastered at levels above 0dBFS i.e. the audio is clipped as delivered³.

All of that can be fine tuned by hand, and I tried a few tweaks that may prove useful on other, more complicated examples. But as my dataset was quite simple, and sox has a neat -G option to do it automagically, I used the latter.

Problem #3: Dither. Same thing. Sox does a lot and dither awaits around the corner. Apply dither whenever reducing bit depth, to ameliorate the bad effects of quantization error, with the dither sox intent.

All of that to obtain a good input audio file:

sox command #2

For more wizardry with sox, I recommend reading the documentation of course (how could I avoid saying once more RTFM?), but for a TLDR reading, that quite old article from thegeekstuff.com sets basics.

Querying API.AI

Starting with api.ai, we needed to query the API with each of the use cases.

excerpt from api.ai’s documentation
Excerpt from the documentation

In my cases I wanted to benchmark the APIs for german. English would have been too easy. So just make sure to have the proper "lang" : "de" or whatever fits your needs.

For other parameters, see the short table documentation.

Here is our request in the scope of this benchmark:

curl -k -F "request={'timezone':'Europe/Berlin', 'lang':'de'};type=application/json" -F "voiceData=@Example1.wav;type=audio/wav" -H "Authorization: Bearer YOUR_ACCESS_TOKEN" "https://api.api.ai/v1/query?v=20150910"  

Only thing, make sure your account is at least among the STARTUP or STANDARD plans (for german -> custom models are needed).

api.ai price grid
api.ai price grid

Querying Nuance

At last a german-enabled ASR API! A request saying Hello world looks roughly like that:

https://dictation.nuancemobility.net/NMDPAsrCmdServlet/dictation?app  
Id=NMAID_FOO&appKey=525348e77144a9cee9a7471a8b67c50ea85b9e3eb377a3c2  
a3a23dc88f9150eefe76e6a339fdbc62b817595f53d72549d9ebe36438f8c2619846  
b963e9f43a93&id=57349abd2390 HTTP/1.1  
Transfer-Encoding: chunked  
Content-Type: audio/x-pcm;bit=16;rate=8000  
Accept: text/plain  
Accept-Language: en-US  
... audio content ...

And triggers the following answer:

HTTP/1.1 200 OK  
Date: Tue, 31 Aug 2010 22:50:35 GMT  
Content-Type: text/plain;charset=utf-8  
Content-Language: en-US  
Content-Length: 11  
x-nuance-sessionid: 97bd6505-b7d6-420a-8eb7-7583036f7aa1  
Hello world  

So far so good. We want Accept-Language: deu-DEU as we aim for german. As this is the audio we have, let’s set Content-Type: audio/x-wav;codec=pcm;bit=16;rate=16000.

More specifics are important to know and explained in the Nuance documentation³. Especially the X-Dictation-AudioSource.

Let’s create a test request:

curl -X POST   
    --header "Content-Type: audio/x-wav;codec=pcm;bit=16;rate=16000"  
    --header "Accept: text/plain" 
    --header "Accept-Topic: Dictation" 
    --header "Accept-Language: deu-DEU" 
    --header "X-Dictation-NBestListSize: 1" 
    --data-binary @Example1.wav 
    "https://dictation.nuancemobility.net:443/NMDPAsrCmdServlet/dictation?appId=<APP_ID>&appKey=<APP_KEY>"

API returns:

War ich am Problem ich hab das Windows den Upgrade auf meinen Rechner und WLAN dem Arbeiten keinerlei Netzwerkverbindung an im Internet Google komme gerne weiter.

Querying Google

Example of API usage:

curl -X POST   
    --header 'Content-Type: audio/x-wav; rate=16000;' 
    --data-binary @Example1.wav 
    'https://www.google.com/speech-api/v2/recognize?lang=en-us&key=<KEY>'

Google Speech API is not “production” ready. As they advertise it, it is still in Alpha stage.

  • No pricing yet
  • Experimental status can change API at any time
  • No official API documentation or usage capabilities
  • Limitations of approximately 500 requests per day, per account
  • You need to join Chromium-dev mail group and generate appropriate key in Google developer console

Evaluation/Metrics

We used multiple quality metrics, such as:

  • Volume of exact recognized phrases
    • Simple, but a paramount quality metric
    • Larger number of exact recognized phrases, the better quality of speech recognition results
  • Word Error Rate (WER)
    • Minimum number of words edits (I.e., insertions, deletions or substitutions) required to change one phrase into the other
    • Normalized by phrase length (basically leveraging Levenshtein distance between two phrases working at the word level, instead of the phenomenal level)
    • Fewer number of required edits, which meant that the phrases are more like each other – offering the best quality of speech recognition

Exact phrase match and word error rate are only two issues to provide world-class voice search that your customers will soon expect. Additional challenges are speech recognition performance and recognizing support-specific terms.


For comparison with english, results will be compared to the results of a 3-months old study of almost the same services based on the same metrics, but with the eCommerce domain and english audio:

Word Error Rate (less is better):
wer
Percentage of Exact Recognized Phrases (more is better):
exact_phrases

Google comes by far as the out-of-the-box leader. Maybe tweaks are needed/available on other services? Plus, Google claims a 8% WER and 15% were attained. There is certainely tweaking to do. But let’s see how all that behaves with german.