๐ Torchaudio: Audio Processing and Deep Learning with PyTorch
If you're working on audio applications like speech recognition, sound classification, or audio synthesis, and you're a fan of PyTorch, then Torchaudio is your go-to library.
Torchaudio is a powerful and flexible Python library developed by the PyTorch team for loading, transforming, and processing audio data, and for integrating it seamlessly into deep learning pipelines.
๐ฏ Why Torchaudio?
Torchaudio offers:
-
๐ฅ Efficient audio I/O: Load and save
.wav
,.mp3
,.flac
, etc. -
๐️ Signal processing transforms: Spectrograms, MFCCs, Mel filters
-
๐ง Pre-trained models: For speech recognition and speaker ID
-
๐ PyTorch integration: Perfect for training and inference
๐ Installation
Install it via pip:
pip install torchaudio
Make sure PyTorch is already installed:
pip install torch
๐ฅ Loading Audio
import torchaudio
waveform, sample_rate = torchaudio.load("audio.wav")
print(waveform.shape, sample_rate)
This gives you a PyTorch tensor with the raw waveform and the sample rate. Super easy.
๐ Audio Transforms
Torchaudio includes many transforms out-of-the-box:
import torchaudio.transforms as T
mel_spectrogram = T.MelSpectrogram(sample_rate=sample_rate)(waveform)
db_spec = T.AmplitudeToDB()(mel_spectrogram)
Common transforms:
-
Resample
-
Spectrogram
-
MelSpectrogram
-
MFCC
-
PitchShift
-
TimeStretch
All are differentiable, which means you can use them inside your training loops.
๐ง Pre-trained Models
Torchaudio offers several high-quality pretrained models:
-
๐ข Wav2Vec2, HuBERT, WavLM – for speech recognition and embeddings
-
๐ฃ️ Speaker Verification –
ECAPA-TDNN
-
๐งพ Forced Alignment – with CTC decoders
Example: Speech Recognition
import torchaudio
bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
model = bundle.get_model()
waveform, sample_rate = torchaudio.load("speech.wav")
waveform = waveform.mean(dim=0, keepdim=True) # convert to mono
with torch.inference_mode():
emissions, _ = model(waveform)
# Use decoder or beam search for transcription
๐ Use Cases
-
๐ Speech Recognition
-
๐️ Speaker Identification
-
๐ถ Music Genre Classification
-
๐งฌ Audio Embedding & Clustering
-
๐️ Sound Effects & Augmentation
-
๐ง Training custom audio neural networks
๐ฆ Integration with PyTorch
Torchaudio works like any other PyTorch dataset and transform pipeline. You can easily integrate it with:
-
torch.utils.data.Dataset
-
DataLoader
-
nn.Module
and training loops
Perfect for building end-to-end models.
๐งช Dataset Utilities
Torchaudio includes popular datasets like:
-
LJSpeech
-
CommonVoice
-
LibriSpeech
-
VCTK
-
YESNO
-
and more...
dataset = torchaudio.datasets.LIBRISPEECH("./data", url="test-clean", download=True)
waveform, sample_rate, label, *_ = dataset[0]
⚡ Tip: Combine with Hugging Face Transformers
Use Torchaudio for preprocessing and Hugging Face Transformers (like facebook/wav2vec2-base
) for training large-scale speech models.
๐ฌ Final Thoughts
Torchaudio is the PyTorch-native way to work with audio data. It's fast, flexible, and fully integrates with the PyTorch ecosystem — making it the ideal tool for audio deep learning workflows, from preprocessing to model training and inference.
๐ Useful Links: