As Google now starts understanding even Indian languages written in Roman script and figuring out which language to surface for which query, it would be the culmination of work that started a long time back. Pandu Nayak knows all the effort that has gone in. The Google Fellow and VP – Search has been on the job for 16 years.
“Language is at the very heart of what we do. And, and you can see that in the evolution of search from the early days,” he explains on a video call. “So one of the first things that was sort of (we did) language oriented was spelling correction.” While that might seem very simple now, Nayak, who is part of Google’s core Search leadership team, clarifies that “having a clever way to spell actually requires you to understand language”.
“Subsequently, the next big innovation in language was around synonyms… the idea that words mean different things in different contexts,” he continues. Then came the phase of language understanding where the engine started to understand the difference between the ‘sole’ of a shoe and ‘sole’ fish. While it has progressed, Nayak accepts that they “didn’t quite get it right” as still you will see some results for the sole of a shoe when someone is actually searching for the fish.
“You get all these really interesting phenomena with language. Because language is this sort of complex thing… it is very subtle, nuanced and whatnot,” he tells indianexpress.com, adding how the more recent innovations have been around machine learning and deep learning, “sort of quantum leaps forward in understanding sentences, natural language and conversations”.
“If you don’t do a good job, people are not going to use your product as language is so central to us,” he adds. “Getting that right is essential to being successful with your users.”
Then there was the problem of taking these learnings to other languages, what Google calls localisation. But the statistical approach made this a bit easier. “Fundamentally, the techniques we use tend to be statistical techniques that look at sort of large scale statistics of language and language usage… it’s not that we learn English grammar and then Hindi grammar and so forth,” he says, adding how since the underlying techniques are statistical in nature, they can generalise very easily to many different languages. “As long as you have the right training data in terms of documents in that language and so forth. And by having the right training data, you can learn the right sensibilities.”
But that is just one part. There was then the challenge of having to nail peculiar problems of different languages. “Many East Asian languages like Chinese, Japanese and Korean, have a problem of segmentation. Each character is really like a word… and you have to figure out a way to segment it. So you have special algorithms to do that,” Nayak explains how they tackled the problem of kanji characters.
The German style of putting nouns together to form compound nouns presented a different problem. “To really understand the language, you have to learn how to de-compound it. So you need some special techniques.“
Then in India, there was transliteration, particularly with people writing Hindi in English. “You need some special processing to handle transliteration right so you can get what is actually being said.”
While it is still quite tough to type and search in regional Indian languages, Nayak accepts “having speech recognition as a mode of operation is incredibly valuable”. And that is why he says Google has a “big investment in improving speech recognition for Indian languages” and “getting the right data for training data, the right algorithms in place”.
The work that has already gone in has started showing some results. “They work reasonably well but I think we want to make it much better so that it becomes really easy to do this without error.”
“The other strategy is that we can take your English query and depending on the query we either translate it or we transliterate it to your regional language and show results for that. So now it’s just a tap to flip over to those results and see whether those are better for you or not,” he explains the latest feature Google is rolling out in India.
“You don’t have to type those things in, but we try and guess using translation, which is another one of these sorts of techniques that has come a long way,” he explains how this new technique helps solve the problem of input in Indian languages. Nayak says Google saw a dramatic increase in Hindi search traffic once they launched some of these features.
Also, as contextualisation is a crucial part of voice inputs, Nayak says it is definitely coming to the languages like it is available in English now. “I don’t think that’s like a long term problem. I think that’s just a matter of time.”
On improving search experience in other Indian languages, Nayak says it is a “combination of different factors, understanding language is part of it”. He says that even as Google makes significant progress, it is a fact that there is lesser content available to train algorithms like with English. “Working with the ecosystem to build out that content is something that we are sort of keenly interested in. We’re also looking to jump start with translation for example,” he says, qualifying that after all “what makes search great is the content”.
“Admittedly, there is a lot more to be done. And I think we have a substantial commitment to India and the Indian market and I’m very hopeful that things are going to be even better in the future. And I’m convinced that actually solving these problems will really help users.”
Nayak, who did his graduation from IIT-Bombay and topped it up with a PhD in Computer Science from Stanford University, leads Google’s ranking teams and is particularly interested in language understanding as it applies to Search. As Adjunct Professor in Stanford University’s Computer Science department, he taught Information Retrieval with Chris Manning and also taught a class on Reasoning Methods in Artificial Intelligence. He also worked as CTO of Stratify, Inc after a stint at NASA Ames Research Center where he worked on the Remote Agent project, the first Artificial Intelligence system to be given primary command of a spacecraft.