Whisper - Speech to Text Transformation
You’ve probably heard about ChatGPT by OpenAI, an AI-powered software that is “capable of generating human-like text based on context and past conversations”.
In September 2022, OpenAI published a new software called Whisper. [1]
“Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web.”
Whisper not only is recognizing English speech, or German, or Spanish, but is capable of doing so in 96 languages. For most European languages, the recognition is astonishingly good. The quality depends on the amount of data the artificial intelligence model was trained on. Among the non-European languages are Chinese, Japanese, Korean (all with a lot of training data), and also Indonesian, Arabic, Tamil, Nepali, or Urdu (with much less training data). You can find a list of all languages on page 27 of their paper. [2]
While the outcome will not be perfect, especially in languages with little training data, it is still worth trying it out. A typical use case is the transcription of interviews, many researchers at the CATS work with.
And the beauty of all: you can easily try it out on your own, without any programming.
How to test Whisper on your own:
One way is to use the online service of Google’s colaboratory. There are two major advantages: 1) you do not need to install anything on your onw machine. Instead, you install Whisper in the virtual colab-environment accessible via Google Drive. 2) You can also get free access to fast GPU units (T4), at least for testing. The main downside of this approach is that you need to upload your data to the platform and do not know for sure if your data will not be used by others. There are many easy-to-follow guides available, e.g. [3] (in German).
More advanced users can also install the software on their own machine, it can be run in either Linux, Mac-OS or Windows operating systems.
Whisper runs MUCH better (i.e. faster) if your machine has one (or more) modern GPU's installed. Nevertheless, Whisper also works with "just" a CPU. Only about 10 times slower. But this may still be acceptable if you do not have much material or just want to run some tests. The German magazine c’t published an article about local installation of Whisper in 2023 [4]. There are many other guides and tutorials around.
Whisper was published under MIT license, therefore other developers can re-use it within their own systems. One example is the AV-Portal of the TIB Hannover [5], the digital preservation platform for audio-visual material in Germany.
A number of researchers tested the software already with Chinese, Tamil, or Nepali languages, and the feedback was always positive. Of course, if you wish to use teh texts for systematic research, or prepare to publish your research data later, you need to post-process the transcriptions. Typical tools to do so are CAQDAS tools like MAXQDA and Atlas TI, or F4t/F4a.
If speech to text transformation are of interest to you and you wish to learn more about it, please do not hesitate to contact me. Also, if you already did some experiments with Whisper, please do let me know.
[1] https://openai.com/research/whisper
[2] Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. “Robust Speech Recognition via Large-Scale Weak Supervision.” ArXiv:2212.04356 [Cs, Eess], December 6, 2022. http://arxiv.org/abs/2212.04356. cf. https://cdn.openai.com/papers/whisper.pdf
[3] https://www.youtube.com/watch?v=_9NmFamOrws
[4] Junghärtchen, Immo. “Spracherkennung und Transkription mit KI: Sprache in Text umwandeln mit Whisper | heise online.” c’t Magazin. heise.de, May 26, 2023. https://www.heise.de/ratgeber/Spracherkennung-und-Transkription-mit-KI-….