AI to learn local Tenglish to fit in

Hyderabad: For most multilingual communities like India, people have the tendency of mixing languages. Urban Indian speakers communicate in Hinglish or Tenglish (a blend of Hindi/Telugu and English) like Sui Dhaaga movie lo hero evaru.

This tendency to mix languages is spread across the conversations happening in digital and social media platforms. While it is easy for humans to translate it, for artificial intelligence (AI) it is a Herculean task as it is a combination of multiple languages. Researchers of IIIT-Hyderabad, Microsoft Carnegie Mellon University have developed a system for this code-mix, called Webshodh.

Prof. Manish Shrivastava from IIIT Hyderabad, said, “It is a natural evolution of languages to borrow words and the speaking patterns have phrases from other languages, often noticed among urban Indian language users. We see that in every region of India, specially cities. We switch between our native language and English typically, largely because our education medium is English. Often two languages are mixed; like in Telangana we mix Deccani and Telugu. This is challenging and an important research problem we need to tackle.”

Instead of having perfect Hindi version of AI, it is being developed to have a Hinglish version as there is a necessity now and in future. Code mixed phrases and sentences don’t have a clearly defined structure and are more free-flowing and casual, relying on the common heritage of the two speakers.

Since there is no dictionary for Hinglish or Tenglish, the data was collected from the social media conversations of people, where such languages are more frequently used. Recreating this through algorithms is a challenge. Mr Shrivastava said, “Natural language processing depends largely on availability of annotated data which has been marked by experts. For Indian languages, this annotated data is difficult to find. With code-mix, the problem is even more as it cannot be brought from formal sources. We have to go through social media and other open platforms where comments are public.”

Around 3-4 lakh datasets were created for Hinglish and Tenglish. Dr Manoj Chinnakotla of Microsoft and visiting faculty of IIIT-H said, “Today’s multilingual societies require software which supports interaction in code-mix languages. WebShodh is a testament that despite severe constraint on resources, AI based systems could still be built for code-mix languages. WebShodh currently uses very few resources such as bi-lingual dictionaries. Through more online user interactions, WebShodh has the potential to collect more user data using which the system could be re-trained to further boost accuracy of results.”

Hinglish or Tenglish is a blend of Hindi/Telugu and English.

Latest News