Making Web Search Safer for Everyone

Author(s): City Air News(L-R) Manoj Chinnakotla and Harish Yenala receiving the award at the PAKDD 2017 Conference on 25th May in Jeju Island, South Korea. In today’s day and age, where we depend on web search for every small task or decision,...

Making Web Search Safer for Everyone
Author(s): 

(L-R) Manoj Chinnakotla and Harish Yenala receiving the award at the PAKDD 2017 Conference on 25th May in Jeju Island, South Korea.

In today’s day and age, where we depend on web search for every small task or decision, the autocomplete features in a web search plays an all important role. Autocomplete is basically a feature that is integrated with your on-site search website, and helps complete a word in the search bar, as-you-type, and also provides search suggestions based on related search terms. There are several advantages of the auto-complete feature. A few of these benefits include helping you avoid spelling errors, decrease the number of keystrokes drastically, and also help you come up with better search queries.

So how are these auto-suggestions generated? These are actually taken based on queries issued by folks in and around your location, maybe your country or city, and may also suggest based on what you’ve searched for before. So in short, the auto-suggestions basically are a reflection or mirror of the society that we live in. And like everything else in society, the reflection can at times be negative. The suggestions can be inappropriate as they cause anger and annoyance to a segment of users, or exhibit lack of respect, or be rude and discourteous towards others. Sometimes these suggestions can also be capable of helping people inflict harm to oneself or others. For example, while a child maybe searching for a topic on “kite flying” enters the prefix “ki”, the suggestion could be “killing people”.

There have been multiple instances where suggestions have been found offensive, objectionable, sexual, racist, violent, or have been deliberately misused to malign people, leading to legal complications with authorities and regulators, and to tarnishing of a brand’s image. The importance of safeguarding the vulnerable groups, such as children and marginalised communities, and keeping the Internet open and safely accessible to everyone cannot be overemphasised.

Researchers until now, have used conventional solutions such as manually curating the list of patterns involving such offensive words, phrases and slangs, or the Classical Machine Learning (ML) techniques which use various hand-crafted features (typically words etc.) for learning the intent classifier, or the Standard off-the-shelf deep learning model architectures such as CNN, LSTMs or Bi-directional LSTMs (BLSTMs).

A paper published by Harish Yenala (M.S by Research Student in IIIT Hyderabad), Dr. ManojChinnakotla (Adjunct Professor, IIIT Hyderabad and Senior Applied Scientist, Hyderabad), and Jay Goyal(Principal Development Manager, Microsoft, India) has proposed an interesting and promising technique for automatically identifying such inappropriate query suggestions. Standing tall among more than 450 paper submissions, their paper titled “Convolutional Bi-Directional LSTM for Detecting Inappropriate Query Suggestions in Web Search” received the “Best Paper Award” at the recent Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2017.

The technique proposed in the paper is based on a new field of computer science research known as Deep Learning (DL) – which aims to build machines that can process data and learn in the same way as our human brain does. DL essentially involves building artificial neural networks which are trained to mimic the behavior of the human brain. These networks can learn to represent and reason over the various inputs given to them such as words, images, sounds and so on.

The DL architecture that the team proposes is called “Convolutional Bi-Directional LSTM (C-BiLSTM)”, and is a combination of the strengths of both Convolution Neural Networks (CNN) and Bi-directional LSTMs (BLSTM). Given a query, C-BiLSTM uses a convolutional layer for extracting feature representations for each query word, which is then fed as input to the BLSTM layer. This input captures the various sequential patterns in the query, and outputs a richer representation encoding them. This new richer query representation then passes through a fully connected network that predicts the target class before giving out the output suggestion.

Advantages of C-BiLSTM include the fact that it doesn’t rely on hand-crafted features, is trained end-end as a single model, and effectively captures both local and global semantics. The team also evaluated C-BiLSTM in real-world search queries from a commercial search engine, and the results revealed that it significantly outperformed both pattern based and other hand-crafted feature based baselines. C-BiLSTM also performed better than individual CNN, LSTM and BLSTM models trained for the same task.

Although the focus of the paper was detecting offensive terminology in Query Auto Completion (QAC) in search engines, the team is sure that the technique will be highly effective in other online platforms such as chatbots and autonomous virtual assistants as well, and help these platforms become more contextually aware, culturally sensitive, and dignified in their responses. Using the APIs based on this system and parsing the search text in social media platforms, email services, chat rooms, discussion forums, and search engines will lead to a safer and secure web.

Date: 
Thursday, August 17, 2017