Building the Foundations of Fair Multilingual AI: Factored at the 1st Workshop on Multilingual Data Quality Signals (WMDQS)

Factored AI joins MLCommons, Common Crawl, and researchers worldwide to enhance language ID and create fair, inclusive multilingual data.

Key Takeaways:

Measure what matters — Strengthen multilingual data metrics like LID confidence and diversity.
Quality over quantity — Diverse, well-identified data drives fairer, more accurate AI models.
Metadata is power — Capture provenance to enhance transparency, trust, and reproducibility.

In October 2025, Factored AI joined the global research community at the 1st Workshop on Multilingual Data Quality Signals (WMDQS), held within the Conference on Language Modelling (COLM) in Montreal. Today, over 95% of online content is concentrated in fewer than 20 languages, leaving thousands of others underrepresented or entirely absent from the datasets that shape large language models. This imbalance reinforces digital inequities and limits how AI understands the world’s linguistic and cultural diversity. The workshop’s purpose was clear: to ensure the next generation of language and multimodal models is built on high-quality, inclusive data.

Co-organized by MLCommons, The Common Crawl Foundation, Eleuther AI, Factored AI, and Johns Hopkins University, the event brought together researchers, engineers, and practitioners who share a common mission: creating data foundations that reflect, respect, and represent the full spectrum of human language.

Why Multilingual Data Quality Matters

Behind every language model lies a vast ocean of data. But the more multilingual these systems become, the more fragile their data foundations are. When it comes to online representation, languages differ enormously: data scarcity and noisy collection pipelines remain significant challenges for languages outside the top 20.

This is particularly true for language identification (LID), the first step in any multilingual data pipeline. Although it’s a well-defined multi-class classification problem, models often struggle when thousands of languages are involved, confusing dialects, regional variants, or closely related tongues. Even small inaccuracies in LID propagate downstream, affecting how models learn, generate text, or classify information.

These errors matter. When a model misidentifies a language, it reshapes how the system interprets meaning, style, or even cultural nuance. Poor LID data can distort entire datasets, leading to biased training outcomes and reinforcing global language inequities. In short, every downstream task involving language in a multilingual context depends heavily on the reliability of upstream language identification.

At Factored, this challenge is deeply personal. Our teams have worked on building multilingual datasets that improve fairness and representation across languages, from underrepresented Latin American dialects to low-resource speech corpora. We care because data equity defines the foundation of ethical AI and because inclusion is a technical challenge we are committed to solving.

Inside WMDQS: Insights and Inspiration

Held at Montreal’s Palais des Congrès, WMDQS gathered a mix of experts passionate about multilingual fairness and representation. Talks ranged from the technical to the philosophical, reflecting the complexity of defining “quality” in multilingual data.

Factored’s engineers presented insights from our ongoing work with MLCommons and Common Crawl on evaluating the reliability of language identification in low-resource datasets. Our contribution emphasized practical methodologies for improving LID confidence scores and reducing error propagation in multilingual pipelines, especially for minority and regional languages often overlooked in large-scale training data.

Among the standout sessions were presentations by Sebastian Nagel, Julia Kreutzer, and David Ifeoluwa Adelani, exploring how Common Crawl’s evolving language distribution and Cohere Labs’ LLM experiments reveal the magnitude of this ongoing challenge. The shared message was clear: building high-quality multilingual data is not only a technical challenge, it’s a collective responsibility.

Lessons We’re Bringing Back to Factored

Attending WMDQS reinforced how deeply every aspect of AI performance ties back to data quality signals. The key takeaways include:

  • Measure what matters: Strengthen multilingual data-quality metrics, including LID confidence, noise ratios, and cross-language diversity scores.

  • Understand compounding errors: Even small LID mistakes can cascade through training pipelines, distorting how models represent minority languages.

  • Quality over quantity: More data doesn’t mean better data. True progress comes from diversity, accurate identification, and thoughtful structure.

  • Metadata is power: Recording provenance and reliability information at ingestion time builds transparency, reproducibility, and long-term trust.

At Factored, these lessons continue to guide how we build datasets, refine models, and support global partners who share our vision for fairer AI systems.

Looking Forward: A Shared Vision

The future of AI depends on inclusive data foundations. As language and multimodal models become part of everyday tools, ensuring equal representation across languages is both a technical and human imperative. Fair multilingual data is more than a research milestone, it’s a shared commitment to a world where every voice matters.

Workshops like WMDQS remind us that building equitable AI requires collaboration across communities, institutions, and cultures. Factored is proud to stand among those driving this effort forward, contributing not only engineering expertise but also a spirit of openness and shared progress.

This experience was made possible thanks to Factored’s ongoing support and belief in empowering engineers to lead at the frontier of AI research. Our participation in WMDQS reflects our broader purpose of empowering brilliant minds to build the future of data and AI responsibly.

We invite the global community to join us in this mission: to make AI fairer, more inclusive, and truly representative of the world it seeks to serve.

Continue Reading

Factored AI expands its Snowflake collaboration to deliver faster, smarter, and more scalable AI data cloud solutions for clients.
Learn More >
Factored explores using LLMs to evaluate each other’s outputs, tackling hallucination and boosting reliability.
Learn More >
See how Meta used RL to reduce demographic bias in ad delivery, improving fairness and compliance.
Learn More >

We cover 100% of U.S. time zones, becoming a natural extension of your team

Hire the highest-caliber engineers in under a week
Build IP that belongs to you
Accelerate your roadmap
Let's talk