Researchers at Amazon Web Services' (AWS) AI Lab have discovered that a large amount of online content comes from machine translation (MT) sources.
This content has been translated into many different languages and is often of low quality. The team says this highlights the critical need for data quality and source considerations when training large-scale language models (LLMs).
The researchers also found that machine-generated content is common in translations of low-resource languages and accounts for a significant portion of all content on the web.
“In fact, we became interested in this topic when several colleagues who work in MT and are native speakers of low-resource languages noticed that much of the Internet in their native languages seemed to be generated in MT. Because I pointed it out,” said Mehak Dhaliwal, former applied science intern at AWS. current doctoral student at the University of California, Santa Barbara, told Motherboard.
“So the insights actually come from people who speak low-resource languages, and we did the research to better understand the problem and see how widespread it is.”
The team has developed a vast resource known as the Multi-Way ccMatrix (MWccMatrix) to better understand the characteristics of machine-translated content. This resource contains 6.4 billion unique sentences in 90 different languages and includes translation tuples (sets of sentences in different languages that have been translated into each other).
The study, submitted to Cornell University's preprint server arXiv, found that vast amounts of web content are often translated into numerous languages, primarily through machine translation. This content is not only prevalent in translations for low-resource languages, but also constitutes a significant portion of all web content in these languages.
The researchers also found a selection bias in the type of content that is translated into multiple languages, perhaps for the purpose of generating advertising revenue.
The paper concludes that “MT technology has improved dramatically over the past decade, but still falls short of human capabilities.” Because MT content has been added to the Web over the years using the MT systems available at the time, much of the MT on the Web can be of very low quality by modern standards. This may produce a less fluent LLM model with more hallucinations, and selection bias indicates that the data may be of lower quality even before considering MT errors. I am. Data quality is very important in LLM training, and high-quality corpora such as books or Wikipedia articles are typically upsampled several times. ”