Is WhisperX free to use?

Yes, WhisperX is completely free and open-source under the BSD-4 license. There are no API keys, usage limits, or monthly fees. Your only cost is the GPU hardware to run it on — either your own machine or a cloud GPU instance.

How much faster is WhisperX than regular Whisper?

WhisperX achieves up to 70x real-time speed with the large-v2 model on high-end GPUs like the NVIDIA A100. On consumer GPUs like the RTX 3090 or 4090, expect 20-30x real-time. A one-hour audio file that takes an hour with base Whisper can finish in 1-3 minutes with WhisperX.

What GPU do I need to run WhisperX?

For the large-v2 model (best accuracy), you need at least 8GB of VRAM — an NVIDIA RTX 3070 or better. The smaller models (base, small, medium) can run on 4GB VRAM GPUs. For best performance, an RTX 3090, 4090, or cloud A100 is recommended.

Can WhisperX do real-time transcription?

No. WhisperX is a batch processing tool — you provide an audio file and get back a complete transcript. It does not support WebSocket streaming or real-time captioning. For live transcription, consider Deepgram or AssemblyAI's streaming APIs.

WhisperX Review 2026: Word Timestamps, Speaker Diarization & 70x Speed

Quick Facts

Starting priceFree

PlatformsLinux, Windows

Offline modeYes

Best forDevelopers processing large audio archives, Researchers needing word-level timestamps and speaker labels

Languages13 languages

Free trialNo

AI poweredYes

PricingFree

Our Verdict

Best for developers and researchers who want Whisper-quality accuracy with word-level timestamps, speaker labels, and 70x speed — all without per-minute API costs. Skip if you need real-time streaming or a non-technical hosted solution.

Rating Breakdown

Accuracy8.5

Speed8.0

Ease of Use5.5

Value for Money9.5

What We Like

70x faster than base Whisper — one-hour files process in under a minute on high-end GPUs
Word-level timestamps accurate to 20-50ms via wav2vec2 forced alignment
Speaker diarization through pyannote-audio labels who said what in multi-speaker recordings
Completely free and open-source with no API keys, usage limits, or cloud dependency
VAD preprocessing reduces Whisper's hallucination problem on silent audio segments

Watch Out For

Requires an NVIDIA GPU with at least 8GB VRAM — no practical CPU-only mode
No real-time streaming support — batch processing only, not suitable for live captioning
Single-maintainer open-source project with uncertain long-term support guarantees
Setup requires Python proficiency, command-line comfort, and manual pyannote token configuration

In-Depth Review

What Is WhisperX?

WhisperX takes OpenAI's Whisper — already one of the most accurate open-source speech recognition models — and fixes the three things that make raw Whisper frustrating for production use: slow processing speed, lack of word-level timestamps, and no speaker identification. It wraps Whisper with batched inference for 70x real-time speed, wav2vec2 forced alignment for syllable-accurate timestamps, and pyannote-audio for speaker diarization.

The result is an open-source pipeline that takes a long audio file and outputs a transcript with precise word timing and speaker labels. No API keys, no usage limits, no monthly bills. You do need a GPU and some Python familiarity to get it running.

Speed: 70x Real-Time Processing

Base Whisper processes audio roughly at 1x real-time on a decent GPU — a one-hour file takes about an hour. WhisperX uses batched inference through the faster-whisper backend to hit 70x real-time with the large-v2 model. That same one-hour file finishes in under a minute on an NVIDIA A100.

The speed improvement makes WhisperX practical for batch processing workflows. Transcribe an entire podcast season, process a day's worth of recorded meetings, or run through a research interview archive without waiting hours per file. On consumer GPUs like an RTX 3090 or 4090, expect roughly 20-30x real-time — still dramatically faster than vanilla Whisper.

Word-Level Timestamps via Forced Alignment

Whisper produces segment-level timestamps — typically chunks of 5-30 seconds. WhisperX runs a second pass using wav2vec2-based forced alignment to pin each word to its exact position in the audio. The alignment is accurate to roughly 20-50 milliseconds, which is precise enough for subtitle generation, karaoke-style highlighting, and frame-accurate video editing.

This two-pass approach (transcribe first, then align) works better than trying to extract timestamps during decoding. You get Whisper's full language model accuracy for the text, plus phonetic alignment accuracy for the timing. The tradeoff is additional processing time for the alignment pass, though it adds only 10-15% to total runtime.

Speaker Diarization with Pyannote

WhisperX integrates pyannote-audio for speaker diarization — identifying who spoke each segment. You get labels like SPEAKER_00, SPEAKER_01, etc., mapped to each word or segment in the transcript. This makes WhisperX viable for multi-speaker content like interviews, panel discussions, and meetings where knowing who said what matters.

Diarization quality depends on audio conditions. Clean recordings with distinct speakers work well. Overlapping speech, similar-sounding voices, or noisy environments degrade accuracy. For critical applications, expect to do some manual correction on the speaker labels.

VAD Preprocessing: Fewer Hallucinations

A known Whisper problem is hallucination during silent segments — the model generates phantom text when there's no speech. WhisperX runs Voice Activity Detection (VAD) as a preprocessing step, stripping silent sections before they reach Whisper. This significantly reduces hallucinated output, especially on recordings with long pauses or quiet sections.

The VAD filter also speeds up processing by skipping non-speech audio entirely. A two-hour recording with 30 minutes of actual speech only processes 30 minutes through the model.

Setup and System Requirements

WhisperX runs on Linux and Windows (macOS support is limited). You need Python 3.8+, PyTorch with CUDA support, and an NVIDIA GPU with at least 8GB VRAM for the large-v2 model. The smaller models (base, small, medium) run on 4GB VRAM but with reduced accuracy.

Installation is via pip and conda, with a few manual dependency steps for pyannote (which requires a Hugging Face token for model access). The setup is more involved than a commercial API but straightforward for developers comfortable with Python environments. Expect 15-30 minutes from clone to first transcription.

When to Use WhisperX vs Commercial APIs

Choose WhisperX when you need word-level timestamps and speaker labels without per-minute costs. It's ideal for batch processing large archives, research transcription, subtitle generation, and any workflow where you're processing more than a few hours of audio per month — the cost savings over Deepgram ($0.0043/min) or AssemblyAI ($0.0065/min) add up fast.

Choose a commercial API when you need real-time streaming transcription, zero setup, or guaranteed uptime. WhisperX is a batch processing tool — there's no WebSocket streaming endpoint. If you're building a live captioning system or voice bot, Deepgram or AssemblyAI are better fits.

Comparison: WhisperX vs Base Whisper

Base Whisper gives you a transcript with segment-level timestamps. WhisperX gives you word-level timestamps, speaker labels, 70x faster processing, and fewer hallucinations. If you're using Whisper for anything beyond quick one-off transcriptions, WhisperX is the strictly better option — it uses the same underlying model with production-grade enhancements layered on top.

Limitations

WhisperX requires a GPU (no CPU-only mode for practical use), Python proficiency, and comfort with command-line tools. There's no GUI, no web interface, and no hosted version. Speaker diarization needs a Hugging Face access token. The project is maintained by a single researcher (Max Bain), so long-term support depends on community and academic backing. And like all Whisper-based tools, accuracy drops on heavily accented speech, low-quality audio, and domain-specific jargon without fine-tuning.

Verdict

WhisperX is what Whisper should have been for production use. The 70x speed improvement, word-level timestamps, and speaker diarization transform Whisper from a research demo into a practical transcription pipeline. Best for developers and researchers processing large audio archives who want Whisper-quality accuracy without per-minute API costs. Skip if you need real-time streaming, a hosted service, or a non-technical setup.

Key Features

70x real-time batched inference
Word-level timestamp alignment
Speaker diarization
Voice Activity Detection preprocessing
Multiple Whisper model sizes
SRT/VTT subtitle export
Multilingual transcription
GPU-accelerated processing

Pricing Plans

WhisperX FAQ

macOS support is limited. WhisperX is primarily designed for Linux and Windows systems with NVIDIA GPUs. While it may run on macOS with Apple Silicon via MPS backend, this is not the officially supported configuration and performance will be significantly lower than CUDA-accelerated processing.