Publications - Multilingual Data Workshop

Engineering the foundations of reliable multilingual AI through data-centric research and precision.

At Factored AI, we drive cutting-edge research that strengthens the foundations of modern AI systems. Our Centers of Excellence operate as research accelerators, combining applied machine learning expertise, data engineering rigor, and real-world execution to advance the science of data-centric AI. Built by brilliant teams from the top 1% of global talent, Factored accelerates the way organizations and researchers design, deploy, and scale reliable AI systems that deliver measurable impact.

As part of our ongoing collaboration with leading organizations such as MLCommons and the Common Crawl Foundation, Factored will participate in the Conference on Language Modeling (COLM 2025) in Montréal, leading the 1st Workshop on Multilingual Data Quality Signals, an open-science initiative uniting researchers and organizations committed to improving multilingual data quality and transparency in AI development.

This collaboration reinforces our role as technical leaders and trusted research partners, advancing data integrity and reproducibility across the global AI ecosystem.

‍

Factored AI - Common Crawl - ML Commons - Eleuther AI -COLM

‍

This workshop addresses one of the most critical challenges in language modeling today: ensuring data quality in multilingual pre-training. High-quality data, not just scale, is what determines the reliability, inclusiveness, and long-term value of large language models. This research directly supports our mission to build AI systems that serve companies globally, across languages, domains, and cultures, making technology more accessible, equitable, and effective worldwide.

‍

From Research to Implementation

Through our Centers of Excellence, Factored AI bridges research and real-world delivery. We play a key role in designing data pipelines, evaluation frameworks, and scalable infrastructure for multilingual benchmarks that will guide future model development.

Factored’s elite talent drives impact beyond implementation; two of our engineers serve on the Organizing Committee, shaping both the technical and collaborative dimensions of this open-science initiative. Their work focuses on building open, high-coverage systems for language identification in web-scale data, a critical component in improving LLM pre-training pipelines and advancing global AI research

This contribution continues our commitment to open science and reproducibility, accelerating innovation while upholding the highest standards of data integrity and research transparency.

Sara Hincapié (Factored AI): Senior Software Engineer with experience in product-focused full-stack development and machine learning model integration. Her work focuses on building accessible, user-centered web applications and tools. She has contributed to projects such as MLSuperb 2.0, Helpmed, and Dynabench.

Rafael Mosquera (Factored AI): Senior Machine Learning Engineer specializing in NLP and audio ML systems. He has contributed to several projects, including BabyLM, the Prism dataset (NeurIPS 2024 Best Paper Award - Datasets & Benchmarks), People's Speech dataset, and Dynabench.

‍

Why It Matters

Collaborative infrastructure ensures that progress in AI benefits every research field relying on language data.
Improved data signals lead to more robust and fair multilingual models.
Community-driven benchmarks help define global standards for data quality.

This research reflects Factored’s focus on advancing responsible, human-centered AI, developing technologies that are as ethical and inclusive as they are effective.

The Factored AI Impact

Our participation at COLM 2025 reinforces Factored’s position at the intersection of academic excellence and applied engineering, delivering technical contributions that shape how the world builds, measures, and trusts AI systems.

With Brilliant Teams accelerating AI, we turn research into real-world impact. ensuring innovation moves forward reliably, responsibly, and at scale.

Learn more about the Workshop on Multilingual Data Quality Signals.‍

Multilingual Data Workshop

Key Takeaways:

Engineering the foundations of reliable multilingual AI through data-centric research and precision.

From Research to Implementation

Why It Matters

The Factored AI Impact

Covering 100% of U.S. time zones, becoming a natural extension of your team

Continue Reading