How to transcribe podcast audio (WhisperX with speaker diarization)

Note: sometimes WhisperX is WAAYYYY too slow so I often end up using which somehow runs much faster.

I do a lot of podcast transcription work and had need for it again today. The HuggingFace spaces (like this one always error out so aren’t very useful.

This is the one that worked for me.

Note: if you run into a New error: ‘soundfile’ backend is not available error, conda install -c conda-forge libsndfile to fix.

  1. make sure you have the .wav for your podcast audio. you can use quicktime or audacity to convert it. this process doesnt work for mp3
  2. pip3 install git+ this will take a couple minutes. meanwhile…
  3. Read—diarization. To enable VAD filtering and Diarization, include your Hugging Face access token that you can generate from Here after the —hf_token argument and accept the user agreement for the following models: Segmentation , Voice Activity Detection (VAD) , and Speaker Diarization. make sure to accept them all in your huggingface account.
  4. whisperx YOUR_AUDIO_FILE.wav --hf_token YOUR_HF_TOKEN_HERE --vad_filter --diarize --min_speakers 3 --max_speakers 3 --language en for 3 speakers in English. remember it must be a .wav file.


It takes about 30 seconds to transcribe 30 seconds so be prepared for it to take the time of your audio podcast to transcribe.


Leave a reaction if you liked this post! 🧡
Loading comments...

Subscribe to the newsletter

Join >10,000 subscribers getting occasional updates on new posts and projects!

I also write an AI newsletter and a DevRel/DevTools newsletter.

Latest Posts

Search and see all content