Machine learning requires large quantities of labeled training data (for more insights, read more in this post). That means, in order to reach acceptable performance, current speech recognition systems training demands thousands of hours of transcribed speech. For almost every language from over 7,000 spoken worldwide unlabeled data in the form of untranscribed audio files is easier, faster, and less costly to obtain.
Self-supervised learning as a solution
The solution for this challenge could be the wav2vec 2.0 – a framework for self-supervised learning of representations from raw audio data. With this approach, the amount of labeled data used was lowered to just one hour – 100 times less than the comparable semi-supervised models. Yet the model outperformed the previous state-of-the-art self-training methods. According to the researchers, it’s also conceptually more straightforward. Meta released the pre-trained models and code in 2020.
LeBenchmark and its advantages
In our work on the LeBenchmark initiative, we want to measure the importance of using French data to pre-train such a model when the final task is related to French. To do that, we pre-train wav2vec 2.0 models in French. Notice that this needs a lot of data and a lot of computational power: we collected and prepared ~7000 hours of audio speech in French, and we made the computation on a national supercomputer, Jean Zay.
We compare the performances of our models in French to the LS960 model, pre-trained by Meta in English, and to the XLSR-53 model for 53 languages, including French. You can find it here: https://github.com/pytorch/fairseq/blob/main/examples/wav2vec/README.md
For different speech processing tasks in French, such as speech recognition, speech translation, spoken language understanding, and speech emotion recognition LeBenchmark models pre-trained in French perform strongly better than the model pre-trained in English (LS960). Considering the fact that the LS960 model performs better than the XLSR-53 model makes LeBenchmark models the most concise ones.
What comes next?
To give the broad public the access and possibility to easily develop speech processing models for different tasks in French, including speech recognition, the LeBenchmark models are freely distributed on HuggingFace: https://huggingface.co/LeBenchmark
LIA plans to pre-train models from Deutsche Welle data in several other languages. In other words, we are preparing the SELMA wav2vec 2.0 models.
Image courtesy: Patrick Tomasso on Unsplash