finrift
New AI Tool Capable of Translating 200 Languages

Machine translation has always been a crucial task in natural language processing. The advent of pre-trained models has dramatically enhanced the performance of machine translation systems. However, current machine translation primarily focuses on translating between English and other languages. Many languages lack sufficient training data, making it challenging to develop a comprehensive machine translation model.

This translation model supports two-way translation between over 200 languages and is also open-source. Following the release of the open-source large-scale pre-training model OPT, Meta AI has introduced its latest innovation, NLLB (No Language Left Behind).

The model includes support for Simplified Chinese, Traditional Chinese, and Cantonese. In addition to widely spoken languages like Chinese, English, French, and Japanese, it also covers many less common languages. The new module is said to be capable of translating into 55 African languages, yielding "high-quality results".

With NLLB, people worldwide can access and share web content in their native languages and communicate with others regardless of their language preferences.

Meta announced plans to initially implement this technology on Facebook and Instagram to enhance the quality of machine translation for less common languages on these platforms.

How to support languages with limited corpora

How was this AI model, which has become proficient in over 200 languages, trained?

As per Meta AI, their AI researchers primarily address the issue of limited language corpora through three key approaches.

One approach is to automatically construct high-quality datasets for languages with limited corpora. Researchers have created a many-to-many multilingual dataset called Flores-200. Professional human translators and reviewers adhere to unified standards to ensure both the quality and quantity of this dataset.

Initially, translators translated all sentences in Flores-200 and verified them; subsequently, an independent team of reviewers commenced reviewing the translation quality and submitted certain translations for post-editing based on their evaluations.

If the quality assessment indicates that the quality exceeds 90%, the language is deemed to be included in Flores-200.

In the final tally, Flores-200 encompassed translations of 842 distinct articles, comprising a total of 3,001 sentences.

The second approach involves modeling 200 languages: researchers devised a language identification system (LID) to determine the language of a given text.

Supervised-trained LID models might struggle with spotting faulty grammar and incomplete phrases in seemingly coherent sentences.

The third approach involves doubling the coverage of FLORES, a human translation evaluation benchmark, to assess the translation quality of each language. While automated scoring is crucial for advancing this research, manual evaluation remains essential for accurate assessment.

By combining AI automatic scoring with manual evaluation, the translation proficiency can be extensively quantified, facilitating enhancements in translation quality.

Related Articles