As AI continues to shape global communication, language inclusivity remains one of its most significant challenges. To close this gap, MLCommons has unveiled the Unsupervised People’s Speech dataset, an unprecedented multilingual audio collection with over 1 million hours of speech data across 89+ languages. This initiative opens the door to more equitable and accessible voice technologies for billions of speakers worldwide.
Engineering the Largest Open Speech Dataset
The project was co-led by Factored’s Center of Excellence in AI & Machine Learning. In collaboration with MLCommons and Common Crawl, Factored played a central role in developing scalable data pipelines, voice activity detection (VAD), and language identification using Whisper Large v3, the result: over 736,000 hours of validated speech now freely available under permissive licenses.

Building the Infrastructure for Low-Resource Languages
This massive corpus empowers the development of speech recognition models in both high- and low-resource languages, from English and Spanish to Quechua and Amharic. Speech detection was fine-tuned using advanced VAD tools, and over 48 TB of audio data was processed and stored with cutting-edge engineering techniques.
Factored also supported ongoing efforts in deduplication and self-supervised training, helping set the stage for more robust and ethical multilingual AI systems.
A New Standard for Global AI Fairness
This dataset is a pivotal step toward digital inclusion. By enabling natural voice interaction in native languages, this initiative reshapes how technology serves underrepresented communities. It also supports downstream tasks like speech synthesis and multilingual NLP, offering unprecedented resources to researchers and developers globally.
To see the full Paper, click here.