ยท4 min read

Whisper by OpenAI: Finally Good Open-Source Speech Recognition

whisperspeechtools

OpenAI quietly dropped Whisper in September 2022, and I think it might be the most practically useful thing they've released. More useful than GPT-3 for most developers. I know that's a bold claim. Let me explain.

Before Whisper, open-source speech recognition was rough. You had DeepSpeech, which Mozilla abandoned. You had Wav2Vec 2.0, which was impressive but painful to set up for production use. And then you had the commercial APIs from Google, AWS, and Azure, which worked well but cost money and locked you into a cloud provider.

Whisper changed that overnight. A single model that handles transcription, translation, and language detection across 99 languages. And it's open source. Fully downloadable. Run it on your own hardware.

The Architecture

Whisper uses an encoder-decoder transformer, which is a well-established architecture but applied here with some smart choices. The audio gets converted to a log-mel spectrogram, chunked into 30-second segments, and fed through the encoder. The decoder then generates text tokens autoregressively.

The real magic is in the training data. OpenAI trained Whisper on 680,000 hours of multilingual audio from the web. That's an absurd amount of data. And because it includes both transcription and translation pairs, the model can transcribe audio in one language and output text in another.

The model comes in five sizes, from tiny (39M parameters) to large (1.5B parameters). The tiny model runs in near real-time on a laptop CPU. The large model needs a GPU but produces genuinely impressive results.

Trying It Out

Installing Whisper is almost comically simple:

import whisper
 
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

That's it. Three lines of Python and you have transcription. No API keys. No authentication. No usage limits.

The first thing I did was test it on Hindi. I recorded myself speaking a couple of sentences in Hindi and ran them through the base model. The transcription was solid. Not perfect, but solid. It handled the Devanagari script output well and even got most of the transliteration right. For context, most open-source speech models barely support English, let alone Hindi. This felt like a breakthrough.

I also ran it on some of the audio from our ReAlive hackathon project, which had synthesized speech with background noise. Whisper handled it better than Google Speech-to-Text on the same clips. That surprised me.

How It Compares

I tested the large model against Google Cloud Speech-to-Text and AWS Transcribe on a small benchmark of English audio clips with varying noise levels.

Clean audio: All three performed similarly. Word error rates under 5%.

Noisy audio: Whisper pulled ahead. It was noticeably more robust to background noise, overlapping speech, and non-native accents. Google was close but AWS struggled more.

Hindi audio: Whisper won by a wide margin. Google's Hindi support exists but is inconsistent. Whisper was more reliable, especially on conversational speech.

The trade-off is speed. Whisper's large model is slower than the cloud APIs because you're running inference locally. But the base model is fast enough for most applications, and you're not paying per minute of audio.

Why This Matters

Look, I spent a chunk of my hackathon season working with audio and voice features. ReAlive had audio synthesis. Meta-Identity had voice cloning. Every time we needed speech-to-text, we reached for a commercial API because the open-source options weren't good enough.

Whisper changes that calculus. It's free, it's good, and it runs locally. For hackathons, side projects, and startups that can't afford cloud API bills, this is a big deal. I've already started using it as my default transcription tool. Running it on my MacBook between classes at Northeastern, transcribing lecture recordings. My classmates think I'm being extra. They're probably right.