Factored AI leads the 1st Workshop on Multilingual Data Quality Signals at COLM 2025 and authors “The Klingon Effect,” advancing multilingual AI research.

Advancing Multilingual AI Research at COLM 2025

Factored leads COLM 2025 workshop and authors “The Klingon Effect,” advancing multilingual data quality and AI reliability.

Key Takeaways:

1. Factored cled the 1st Workshop on Multilingual Data Quality Signals at COLM 2025 in collaboration with MLCommons and Common Crawl.
2. The paper “The Klingon Effect” reveals how linguistic regularity in constructed languages influences model reliability in multilingual AI.
3. Strengthens global research on data-centric AI, reproducibility, and responsible multilingual model development.

Redefining the boundaries of multilingual AI through precision, research, and real-world execution.

At Factored AI, we push the frontiers of multilingual modeling by merging academic rigor with applied engineering. Our Centers of Excellence serve as research accelerators, combining deep expertise in machine learning, data engineering, and infrastructure design to strengthen the foundations of reliable, data-driven AI.

Built by brilliant teams from the top 1% of global talent, Factored enables organizations and researchers to design, deploy, and scale systems that perform consistently across languages, cultures, and domains, ensuring that AI works for everyone, everywhere.

Exploring Linguistic Structure to Improve AI Reliability

Factored engineer Teógenes Moura authored “The Klingon Effect: When Constructed Languages Win at LID”, a study revealing how artificially constructed languages, like Klingo, can outperform natural ones in language identification tasks.

This research investigates the intriguing connection between linguistic regularity and model reliability, revealing that the structure and consistency of a language can significantly impact the performance of large-scale multilingual systems.

 Comprehensive comparison of constructed vs natural language identification performance


The findings challenge assumptions about how LLMs learn, suggesting that model robustness depends as much on data design and balance as on volume and scale.

Why It Matters: Designing AI That Understands the World Better

Improving multilingual pre-training isn’t just a research milestone; it’s a necessary step toward fairer, more inclusive AI systems.

  • Better data signals yield stronger, more generalizable models.

  • Collaborative benchmarks define higher standards for multilingual performance.

  • Open-science partnerships accelerate innovation through transparency and reproducibility.

This research contributes to the development of AI systems that are reliable, equitable, and representative of global linguistic diversity, ensuring technology evolves in ways that reflect the richness and complexity of human communication.

Part of a Broader Research Effort on Data Quality

This paper was presented at the 1st Workshop on Multilingual Data Quality Signals, held on October 10, 2025, at the Palais des Congrès in Montréal, Canada, during the Conference on Language Modeling (COLM).


The workshop—co-led by Factored in collaboration with MLCommons and the Common Crawl Foundation brought together global researchers to explore how multilingual data quality shapes the reliability and inclusiveness of large language models.

This study forms part of that broader initiative, reinforcing Factored’s ongoing contribution to open, data-centric research.

Learn more about our role in advancing data quality research at COLM 2025: https://www.factored.ai/publications/colm-2025-multilingual-data-quality-workshop-factored

From Research to Real-World Impact

Through our Centers of Excellence, Factored bridges academic research and practical implementation, translating complex insights into scalable systems that improve how organizations build, measure, and deploy AI.

By integrating deep research collaboration with real-world engineering, we help shape an ecosystem where data quality, transparency, and inclusiveness are the standard—not the exception.

With Brilliant Teams accelerating AI, we transform discovery into impact—ensuring innovation moves forward reliably, responsibly, and at scale.

🔗 Read the full paper: “The Klingon Effect: When Constructed Languages Win at LID.”

Covering 100% of U.S. time zones, becoming a natural extension of your team

Elite engineers ready for flexibility, scalability, and measurable impact.
Build IP that belongs to you
Proven work with the Fortune 500
Get Started