I find that meta’s translations are very poor compared to others, at least for relatively obscure languages, which I figured was relevant considering the article.
Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
Hello from Siem Reap, Cambodia! Awesome to see a fellow tech enthusiast from Cambodia.
I actually found Facebook’s translations pretty good (better than Google Translate for things longer than a sentence). From my understanding of Khmer, Khmer is a bit more verbose and context dependent, hence LLMs in Khmer would be a big help understand those nuances.
In the inverse case (LLMs generating khmer from English) I heard from locals that it sounds formal and “robotic” which I found quite interesting.
So, LLMs are noticeably better in Khmer than Google Translate? I wonder why Google Translate doesn't use Gemini under-the-hood. Perhaps it's more prone to hallucinations.
I'm interested in find some thorough testing of translations on different LLMs vs Translation APIs.
There's a dropdown on Google Translate that lets you choose "Advanced" mode or "Classic" mode. Advanced mode uses Gemini but it's only available for select languages.
I'll be looking at this in detail. I've started a company to do similar things, https://6k.ai
I'm currently concentrating on better data gathering for low-resource languages.
When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.
Off topic, since the AI craze MS‘ documentation translation has ridiculous errors like translating try catch keywords to "versuchen" and "fangen" for German pages
That's a high count, but still a bit away from "Omni". Usual count is between 4k and 8k depending the source. But the first 1k might be the hardest, certainly.
50 comments
Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
(I live in Cambodia where they speak Khmer)
I actually found Facebook’s translations pretty good (better than Google Translate for things longer than a sentence). From my understanding of Khmer, Khmer is a bit more verbose and context dependent, hence LLMs in Khmer would be a big help understand those nuances.
In the inverse case (LLMs generating khmer from English) I heard from locals that it sounds formal and “robotic” which I found quite interesting.
I'm interested in find some thorough testing of translations on different LLMs vs Translation APIs.
(Sorry I had to)
Can't achieve subject-verb agreement in 1st sentence of their English abstract.
Advances made through No Language Left Behind (NLLB) have demonstrated that high-quality machine translation (MT) scale to 200 languages.
Advances have demonstrated... The NLLB part is an adjectival(sic) phrase that modifies the noun "Advances".
Hopefully I'm not wrong...
I'm currently concentrating on better data gathering for low-resource languages.
When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.
Is it open weight? If so, why isn't there just a straight link to the models?
It looks like meta found a way forward.
Reading meta’s abstract, it seems that they have found ways to improve the quality of the training data, and also new evaluation tools?
They are also saying that OMT-LLaMA does a better job at text generation than other baseline models.
https://huggingface.co/spaces/facebook/bouquet