Monday , June 21 2021

How to do speech recognition? – Towards data science

Speech recognition is the task of detecting spoken words. There are many techniques for doing voice recognition. In this post, we will go through the background required for speech recognition and use a basic technique to build a speech recognition model. The code is available on GitHub. For the techniques mentioned in this post, check out this Jupyter Notebook.

"Active coal Google Home Mini and smartphone" by Bence ▲ Boros on Unsplash

Some backgrounds for audio processing

Let's take a step back and understand what audio is actually. We listen to all music on our computers / phones. Usually, they are in mp3 format. But the .mp3 file is not the actual audio. It's a way to represent audio in our computers. We do not open the .mp3 files directly and read them (as we read the .txt files in the notepad). We use applications to open those .mp3 files. Those applications understand what a .mp3 file is and how to play it. These mp3 files encode (represent) the audio.

The audio is represented as waves. Generally, these waves have 2 axes. Time is represented on the x-axis and the amplitude on the y-axis. So in every moment t, we have a value for amplitude.

Sinusoidal wave – A simple audio wave (source)

You can listen to a simple sine wave here. Great! Now we just need to understand how to use these audio files in our code to perform the recognition.

Using audio files

We will use the Waveform audio file format or wav files. So, how do we read these .wav files? Insert librosa – a python package that allows us to read .wav files. What do we get after reading these .wav files? We get a wide range of numbers. This is the output I got after reading one of the sound files 1 second long.

array ([ 0.0007143 , 0.00551732, 0.01469251, …, -0.00261393, -0.00326245, -0.00220675], dtype = float32)

What do these numbers mean? Remember that I told you that the audio is represented as a wave with two axes. These values ​​represent the y-axis of that wave, that is, the amplitude. So, how is the time of the x-axis represented? This is the length of the array! So for a 1 second audio, the length should be 1000 (for 1000 milliseconds). But the length of this array is actually 22050. Where does this come from?

Sampling rate

Consider a 5-second audio clip. If it is analog, it has an amplitude value at each moment of time, that is, it has a value for every nanosecond or perhaps for every second. So, considering a 5-second audio clip, it has a value for each picosecond. They are 5e + 12 or 5000000000000 values. Consider that storing on a computer. It takes 4 bytes in C to store a float value. So it's 5e + 12 * 4 bytes. About 18 terabytes of data are only for a 5-second audio clip!

Analog or digital audio signal (source)

We do not want to use 18 TB just to store a 5-second clip. So we convert it into a discrete form. To convert it into discrete form, we record samples (that is, amplitude values) in each time step. So for a 5 second audio, we can record samples every 1 second. They are only 5 values ​​(samples)! This is called the Sampling rate.

Sampling frequency (source)

Formally, the sampling rate is the number of samples collected per second. These collected samples are spaced at equal intervals over time. For the previous example, the sampling rate is 1 or 1 sample per second. You may have noticed that there is a lot of information loss. This is a compromise in the conversion from continuous (analog) to discrete (digital). The sampling rate should be as high as possible to reduce the loss of information.

So why did we get the 22050 length series? Librosa uses a default sampling rate of 22050 if nothing is specified. You're wondering, why 22050? Well, it's the upper limit for the reach of human hearing. Humans can listen to frequencies ranging from 20 Hz to 20 KHz. That 20 KHz is 22050. A more common sampling rate is 44100, or 44.1KHz.

Also, note that we have obtained a 1D array and not a 2D array. This is because the .wav file I used was mono audio and not stereo. What is the difference? A mono sound has only one channel while a stereo has 2 or more. What is a channel? In simple terms, it is a source of audio. Consider using 1 microphone to record 2 of your friends talking to each other. In an ideal situation, the microphone only records the audio of your friends and no other background noise. This audio you recorded has 2 channels because there are 2 sources of signals: your 2 friends. Now, if there is a sound of a dog barking in the background, the audio will have 3 channels with 3 sources that are your friends and the dog.

Generally, we convert stereo audio to mono audio before using it in audio processing. Again, the book helps us do it. We simply pass the parameter mono = true while loading the .wav file and converting any stereo audio into mono for us.

Functionality for audio recognition

We can use the time domain signal above as functions. But it still requires a lot of computational space because the sampling rate should be quite high. Another way to represent these audio signals is in the frequency domain. We use the Fourier transform. Declaring in simple terms – Fourier transform it is a tool that allows us to convert our signal into the time domain in the frequency domain. A signal in the frequency domain requires much less computational space for storage. From Wikipedia,

In mathematics, a Fourier series is a way to represent a function like the sum of simple sine waves. More formally, it decomposes any periodic function or periodic signal into the sum of a set (possibly infinite) of simple oscillating functions, ie sines and cosines

In simple terms, any audio signal can be represented as the sum of the sine and cosine waves.

A time domain signal represented as the sum of 3 sinusoidal waves. (Source)

In the figure above, the time domain signal is represented as the sum of 3 sinusoidal waves. How does the storage space reduce? Consider how a sinusoidal wave is represented.

The mathematical representation of the sine wave. (Source)

Since the signal is represented as 3 sine waves, we only need 3 values ​​to represent the signal.

Coefficients of cepstral melofrequency (MFCC)

Our voice / sound depends on the shape of our vocal tract including tongue, teeth etc. If we can determine this form precisely, we can recognize the word / character that is said. MFCC is a representation of the short-term power spectrum of a sound, which in simple terms represents the shape of the vocal tract. You can read more about the MFCC here.


Spectrograms are another way of representing the audio signal. Spectrograms transmit three-dimensional information in 2 dimensions (2D spectrograms). On the x axis is the time and on the y axis is the frequency. The amplitude of a particular frequency at a particular time is represented by the intensity of the color at that point.

Waveform and corresponding spectrogram for a word pronounced "yes". (Source)

Overview of the approach

For the .wav files, I used a subset of training data from the Kaggle-Tensorflow Speech Recognition Challenge. Google Colaboratory is used for training. Provides free use of the GPU for 12 hours. It is not very fast but good enough for this project.

Audio files are sampled at 16,000 sample rates. Spectrograms are used to recognize speech commands. I wrote a small script to convert .wav files into spectrograms. Images of the spectrogram are fed into the convolutional neural network. The transfer learning is performed on Resnet34, which is trained on ImageNet. PyTorch is used to code this project.

Stochastic gradient descent with restart (SGDR)

SGDR uses CosineAnnealing as a technique for annealing the learning rate to train the model. The learning rate is reduced at each iteration (non-ephemeral) of the descent of the gradient and after completion of a cycle, the learning rate is restored and set on the initial learning rate. This helps to achieve better generalization.

The idea is, if the model is at the local minimum where a slight change in the parameters changes the loss a lot, so it's not a good local minimum. By resetting the learning rate, we allow the model to find better local minima in the search space.

SGDR for 3 cycles

In the image above, a cycle consists of 100 iterations. The learning speed is restored after each cycle. In each iteration, we gradually decrease the learning rate, this allows us to settle down to the local minimum. So, by resetting the learning rate at the end of a cycle, let's check if the local minimum is good or bad. If it is good, at the end of the next cycle, the model will settle in the same local minima. But if it's bad, then the model will converge to a different local minimum. We can also change the length of the cycle. This allows the model to dive deep into the local minimum reducing the loss.

Snapshot Ensembling

It is a technique used together with SGDR. The basic idea of ​​the whole is to train more than one model for a specific task and to average their predictions. Most models provide different predictions for the same input. So, if a model provides the wrong prediction, another model provides the correct prediction.

Snapshot Ensembling (

In SGDR, we perform ensembling with the help of cycles. Basically, every local minimum has a different loss value and provides different data forecasts. When we do SGDR, we jump from one local minimum to another to find the optimal minimums at the end. But even the predictions of other local minima may be useful. So, we check the model parameters at the end of each cycle. And when making predictions, we give input data for each model and the average forecasts.

Setting changed to reduce training time

Training is ongoing on Google Colab. It provides a Tesla K80 GPU that is good enough for this task. An iteration of the gradient descent takes about 1.5-2 seconds on this GPU. But when the training is done, it takes about 80 minutes to train for a single age! This is because, by default, it is not possible to use more than 1 worker in PyTorch data loaders. If you try, PyTorch generates an error by abruptly interrupting the workout.

But why does it take 80 minutes? It's because the task of preparing the next batch is done on the CPU, while only the descent of the gradient and the weight updates are done on the GPU. When the weight updates are complete, the GPU is inactive, waiting for the next batch. So in this case, the CPU is almost always busy and the GPU is inactive.

When specifying the num_workers parameter in the data loader, PyTorch uses multiprocessing to generate batches in parallel. This removes the bottleneck and ensures that the GPU is used correctly.

How do we do this on Google Colab? Google Colab is based on a Linux system. And most Linux systems have a temporary partition called / Dev / shm. This partition is used by processes as shared memory. It's a virtual memory, which means it does not reside on HDD, it resides on RAM. PyTorch uses this partition to place batches for GPUs.

Google Colab, by default, assigns a 64 MB size to this partition. This dimension is much lower for the use of a sufficient number of workers. Which means that if we try to use num_workers, at some point during training, this partition will overflow and PyTorch will generate an error. The solution is to increase the size of this partition. After increasing the size, we can use many workers to upload data. But how many num_workers should we use?

It seems that using as many num_worker as possible is good. I did some experiments with different sizes of / dev / shm and several num_workers. Here are the results.

It seems that the use of 64 workers is not the best option. Why do we get these results? When we specify a value for num_workers in our data loader, before starting the training, PyTorch tries to fill the number of workers with the lots. So, when we specify num_workers = 64, PyTorch fills 64 workers with lots. This process alone takes 2.5-3 minutes. These are now required by our model. The model then updates the weights based on these batches and waits for the next batch group. This process takes only about 3-5 seconds. At the same time, the CPU is making the next batch group. In Google Colab, there is only one CPU. So, after updating the weights, the GPU is again idle waiting for CPU. Again, there's a wait of about 2 minutes. This process continues. That's why it took about 10 minutes for training when using a very large number of workers.

Thus, in selecting the number of workers, there is a trade-off between the time required by the model to update the weights and the time required by the CPU to generate the next batch group. We have to choose num_workers taking these times into consideration. Choosing 8 workers,we can reduce training time by 96%. You can check this tweak in this Jupyter Notebook.


After all this annoyance, I was finally able to train my model. The model has achieved a precision of 90.4%. This result can be improved with different techniques. Some of them are:

  • Data increase– I have not used any data increase in my data. There are many data increases for audio data such as time shift, speed tune, etc. You can find more information on the increase in data here.
  • Combining Mel Spettrograms + MFCC– The current model provides predictions based only on spectrograms. CNN is equipped with the extraction and the classifier (fully connected level) performs the task of finding the optimal hyperplate from the CNN output functions. Along with these functions, we can also provide the classifier with the MFCC coefficients. These will increase the number of features, but MFCC will provide more information about the audio file to the categorizer. This will help improve accuracy. Adequate regularization will be necessary to avoid over-processing.
  • Use a different type of network– As we have seen with the audio data, it has a temporal dimension. For these cases, we can use the RNNs. In fact, for audio recognition activities, there are approaches that combine CNN and RNN that produce better results than the use of CNN alone.

Source link

Leave a Reply

Your email address will not be published.