Other existing approaches frequently use smaller, more closely matched audio-to-text training datasets,(^reference-1) (^reference-2)(^reference-3) or use broad but unsupervised audio pre-training.(^reference-4)(^reference-5)(^reference-6) Because Whisper was trained on a large and diverse dataset and was not tailored to any specific dataset, it does not beat specialized models in performance from LibriSpeech, a renowned benchmark for speech recognition. However, when we measure Whisper's performance on many diverse datasets, we find that it is much more robust and makes 50% fewer errors than these models.
About a third of Whisper's audio dataset is not in English and it is alternately responsible for transcribing into the original language or translating into English. We find that this approach is particularly effective for learning speech-to-text translation and outperforms supervised SOTA on CoVoST2 towards zero-shot English translation.