While open source initiatives are still ongoing, both Google and Microsoft have added India's Oriya language to their respective machine translation engines this year: Google Translate in February and Microsoft most recently on August 13.
Oriya is the official language of the Indian state of Odisha and the second official language of the state of Jharkhand. Some 35 million people are native speakers, and about four million have it as a second language. The Indian government has also classified it as one of the classical languages of the country, based on a set of requirements that includes a literary tradition of more than 1500 years.
However, the digital presence of the oriya is limited. For example, Wikipedia in Oriya, one of the largest repositories of textual content, currently has only 15,858 articles after being revived in 2011 after a nine-year hiatus. In contrast, Malayalm, with almost the same number of speakers as Oriya, has about 70,000 articles on Wikipedia. For a long time, oriya content was available online as image and PDF – some, such as Utkal Prasanga magazine, run by the Odisha State Government, continue to publish in a combination of image and PDF. The late adoption of Unicode has made content less searchable.
Machine translation is a powerful tool to increase the digital presence of a language, it makes the content easier to search and to access to those who do not speak the language.
Microsoft cloud services, including Microsoft Translator application, Office, Translator for Bing, and through Azure Cognitive translator, will now support all oriya translations. Both Microsoft Translator and Google Translate (available both on the web and in app) allow the translation of copied text directly into the respective field.
Additionally, these platforms also support the translation of text documents, websites, and live chats. The Google Translate mobile app has additional features, including offline translation, handwriting recognition, scanning, translating and reading text from images, and using voice commands to speak to a foreign language speaker. A feature called “tap to translate” allows the user to directly translate written text within any application. You can also hear how a text is pronounced in a supported language with Google speech synthesis.
The addition of oriya was welcomed by the Odisha State Government. The Odisha Chief Minister's Office tweeted:
#HateTranslation has now been added by @Microsoft to its @mstranslator, becoming the 12th commonly used Indian language to be added. This will facilitate access of global information in #Hate and promote inter-language interactions. https://t.co/O4dZgZhbrs
– CMO Odisha (@CMO_Odisha) August 17, 2020
Oriya text translation is now available in Microsoft Translator.
Today, we are pleased to announce that we have added Oriya text translation to Microsoft Translator. The oriya is available now, or will be soon, in the Microsoft Translator app, Office, Translator for Bing, and through the Azure Cognitive translation service for businesses and developers.
Microsoft has added Oriya translation to its translator, and it becomes the 12th commonly used Indian language to be added. This will facilitate access to global information in Oriya and promote cross-language interactions.
The Department of Electronics and Information Technology of the Government of Odisha also reacted:
Used by millions across the world, @Google Translate has now added #Hate to its list of supported languages. A major step towards promoting digital literacy in our native language & to help millions of non-speakers embrace it. #HateOnGoogle @CMO_Odisha https://t.co/lfSskvxSjR
– E&IT Department Odisha (@EIT_Odisha) February 28, 2020
Google Translate adds five languages.
With millions of users around the world, Google Translate added oriya to its list of available languages. A big step to promote digital literacy in our mother tongue and to help millions of non-speakers adopt it.
With the inclusion of oriya, Google Translate and Microsoft Translator now have 11 Indian languages each. In total, Google includes 109 world languages while Microsoft includes 73.
In the meantime, open source initiatives have yet to create successful oriya machine translation projects.
There is at least one community open source project in development – MTEnglish2Odia is training a machine translation engine by collecting translation pairs from existing sources such as Wikipedia on Oriya and collective contributions from users on Twitter.
Also, there are some research and resources that can be used for building machine translation engines by other organizations.
The policy of machine translation
The technology used by Google Translate or Microsoft Translator is complex from a social, legal, ethical and rights perspective.
A machine translation platform can be very useful for many people, such as journalists, to quickly access news in many languages, or students who want to learn from multilingual resources.
Similarly, support for speech synthesis helps people with disabilities, especially those who are blind, to access and disseminate information more easily.
Education, the media and the entertainment industry also benefit from the potential of Google Translate to translate large amounts of content quickly.
On the other hand, machine translation can help spread misinformation, while voice synthesis makes it easier for scammers looking to take advantage of people with communications in their own language.
There are over 6,000 documented languages around the world, and only a minority have established writing systems. Those are the languages that are included in machine translation projects like Google Translate and Microsoft Translator.
The availability of content online, and the number of Internet users who speak a certain language, are important factors that for-profit companies take into account when deciding which languages to include in their systems. The more languages a corporation supports, the more targeted content it can offer to users and the more revenue it generates from advertising.
Additionally, there are ethical attribution and remuneration issues in projects like Google Translate, which has a community structure of contributors to review existing translations (which helps engineers frequently improve the tool).
Although Google is a for-profit company with many paid products – including a cloud translation service – neither individual volunteers nor the many public sources that the machine learns from receive attribution or remuneration.
The use of private communications to improve machine learning and artificial intelligence is also controversial from a privacy point of view, although Google has been working to anonymize that data.
In a country like India, where multilingual content creation faces cost bottlenecks, products like Google Translate and Microsoft Translator can revolutionize the Indian content economy. They can make a difference in projects like Wikipedia, which currently exists in 23 Indian languages, or StoryWeaver, a multilingual online children's literature platform that relies heavily on volunteer work.
As many Indian languages are rapidly disappearing, and with the added challenge of illiteracy and digital accessibility, the communications path needs further innovation in voice and visual technology. Machine translation may be a viable tool to halt the extinction of languages, but in India it still has a long way to go.
Disclaimer: The author has been involved with Wikipedia in oriya as a volunteer since 2011 and with MTEnglish2Odia since its early stages.