OpenAI’s Artificial Intelligence Speech Recognition Model Whisper Explained and Used

1 whisper Introduction

OpenAI, the company that owns the ChatGPT language model, has open-sourced the Whisper automated speech recognition system, and OpenAI emphasizes that Whisper’s speech recognition ability has reached the human level.

Whisper is a general-purpose speech recognition model trained using a large amount of multilingual and multi-task supervised data, capable of achieving near-human level robustness and accuracy in English speech recognition.Whisper also performs tasks such as multilingual speech recognition, speech translation, and linguistic recognition.The architecture of Whisper is a simple end-to-end approach that employs the Transformer model of an encoder-decoder that converts the input audio into corresponding text sequences and specifies different tasks based on special tokens.

OpenAI trained Whisper, an ASR (Automatic Speech Recognition) system, by collecting 680,000 hours of multilingual (98 languages) and multitask supervised data from the web.OpenAI believes that using such a OpenAI believes that using such a large and varied dataset will improve the ability to recognize accents, background noise, and technical terms. In addition to being used for speech recognition, Whisper is capable of transcribing multiple languages and translating them into English. openAI is opening up the model and inference code in the hope that developers can use Whisper as a basis for building useful applications and further research into speech processing technology.

Code Address:code address

2 whisper Model

OpenAI's Artificial Intelligence Speech Recognition Model Whisper Explained and Used

The process by which Whisper performs an operation:

The input audio is segmented into 30-second segments, converted to log-Mel spectrograms, and passed to an encoder. The decoder is trained to predict the corresponding textual descriptions and mixes them with special tokens that guide a single model in performing tasks such as language recognition, phrase-level timestamping, multilingual speech transcription, and speech translation.

Compared to other existing methods currently on the market, they typically use smaller, more tightly paired ‘audio – text’ training datasets, or extensive but unsupervised audio pre-training sets. Because Whisper is trained on a large and diverse dataset without being fine-tuned for any particular dataset, while it does not beat models that specialize in LibriSpeech performance (the famous speech recognition benchmark test), measuring Whisper’s Zero-shot (no re-training on a new dataset, and (which gives good results without retraining on a new dataset) on many different datasets, the researchers found that it was much more robust than those models, making 50 percent fewer errors.

3 Multiple models of whisper

The whisper is available in five model sizes, providing a balance of speed and accuracy, with the English-only model offering four options. Below are the names of the available models, approximate memory requirements, and relative speeds.

OpenAI's Artificial Intelligence Speech Recognition Model Whisper Explained and Used

The official download address for the model:

"tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt",
"tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt",
"base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt",
"base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt",
"small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt",
"small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt",
"medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt",
"medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt",
"large-v1": "https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt",
"large-v2": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
"large": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",

4 Installation of the runtime environment and use of whisper

(1) Conda environment installation

(2) Whisper environment construction

conda create -n whisper python==3.9
conda activate whisper
pip install openai-whisper
conda install ffmpeg
pip install setuptools-rust

(3) Whisper use

whisper /opt/000001.wav --model base

The output is as follows:

[00:00.000 --> 00:02.560] Artificial Intelligence Recognition System.

When the command is executed, the model download is performed automatically, and the path where the automatically downloaded model is stored is as follows:

~/.cache/whisper

It is also possible to formulate a local model run from the command line:

Whisper /opt/000001.wav --model base --model_dir /opt/models --language Chinese

Supported file formats: m4a, mp3, mp4, mpeg, mpga, wav, webm

(4) Use in code

import whisper

model = whisper.load_model("base")
result = model.transcribe("/opt/000001.wav")
print(result["text"])