Facebook auto filter in just 4 Indian languages
Hyderabad: Between July and September 2019, Facebook removed as many as 7 million posts that contained “hate speech” across the world. More than 80 per cent of these posts were detected not by human users but by computer algorithms designed for this purpose. In the past couple of years, social media giants such as Facebook and Twitter have been relying on artificial intelligence to detect hate speech. However, for a country such as India, with all its diversity, the system is far from perfect.
Facebook, in an interview to Time magazine in November 2019, confirmed that it had automatic hate speech detection systems in place only for four Indian languages: Bengali, Urdu, Hindi and Tamil. Across the world, such systems are active in around 40 languages. Other Indian languages, the users of which number in crores, are out of the systems’ purview. For posts written in these languages, Facebook primarily depends on full-time content moderators. These moderators, however, do not scan all the posts but only review posts flagged by other users as offensive or hate speech. In theory, a closed Facebook group could indulge in propagating hate speech through posts written in Assamese, Telugu or Gujarati without trouble. As long as no one reports these posts, Facebook would take no action.
So why is hate speech moderation in Indian languages so difficult? The answer lies in the shortage of data, or rather the shortage of “annotated data”. Hate speech detection models work through deep learning algorithms which learn to detect abusive content using historical data sets. An algorithm is let loose upon a data model of tweets or Facebook posts that have already been labelled as hate speech. This gives the algorithm its definition of hate speech for reference. However, in Indian languages, there is far too little data that has been labelled as hate speech. Vandan Mujadia, a researcher in the Language Technologies Research Centre (LTRC) at IIIT Hyderabad, whose team has worked on the subject extensively, said, “One of our biggest challenges is to find the right data sets to work on. Lakhs of posts are made in languages like Telugu and Tamil everyday, so there is no dearth of raw content. However, this content needs to be sifted through and there aren’t many people who can do this. Identifying hate sp
eech is complex, and not everyone can do that. Once such posts are identified by people who are well versed in regional languages, they can annotate it with their comments on what makes these posts offensive. After that, they can be fed into the algorithm to make it work.”
As of now, the research centre has been depending on volunteers who look through flagged hate speech content to annotate it. But the sheer amount of data being circulated everyday makes the job ever so difficult. The mobile data revolution post the launch of Reliance’s Jio has only made things worse. Since the field is still in its nascent stages, there are not many in the country who are actively researching it. Another problem in India has been the multilingual nature of many internet posts. Many people often tweet using words of multiple languages. In such cases, the algorithm can struggle to understand the post’s context. Another problem in India has been the multilingual nature of many internet posts.