Reddit is headed to a public listing in the coming weeks, and an infusion of cash at this time could boost investor confidence. And Google’s AI system has come under fire — in both the US and India — and the company is looking to make its offerings more reliable.
The deal underlines the value of user-generated content on the Internet, especially for Generative AI (GenAI) platforms like Google’s Gemini and OpenAI’s ChatGPT. It also points to the complexities of training large language models (LLMs) when they can potentially infringe on the intellectual property rights of content producers online — and the reason why such content licensing deals may be crucial.
Story continues below this ad
What is the Google-Reddit deal?
Google will get access to Reddit’s data application programming interface (API), which delivers real-time, structured, unique content from the social media platform.
“With the Reddit Data API, Google will have efficient and structured access to fresher information, as well as enhanced signals that will help us better understand Reddit content and display, train on, and otherwise use it in the most accurate and relevant ways,” Google said in a blog post.
Why is this important for Google?
Google needs access to as much user-generated content as possible to make its foundational model more reliable and accurate. Widespread criticism of responses generated by Gemini has increased the urgency for the company to get its act together, as the likes of OpenAI come to have an increasingly greater say in the future of how users access online services.
India, in response to the question “Is Modi a fascist?” Gemini said that the Prime Minister has been “accused of implementing policies some experts have characterised as fascist”, which include the “BJP’s Hindu nationalist ideology, its crackdown on dissent, and its use of violence against religious minorities”.
Story continues below this ad
After the IT Ministry threatened to issue a show-cause notice for the “illegal and problematic” responses, Google said it had addressed the issue, and was working to improve the system.
The company had earlier apologised for “inaccuracies in some historical image generation depictions” after criticism that Gemini had depicted white figures (like the founding fathers of the US) or groups like Nazi-era German soldiers as people of colour.
Are Google-Reddit-like content licensing deals the future of building LLMs?
The New York Times sued OpenAI and Microsoft, the creators of ChatGPT, and other popular AI platforms last year, citing “unlawful” use of copyrighted content. The NYT said in its lawsuit that the companies were scraping original content from the publisher to build their models and manufacture responses.
The NYT lawsuit has reignited the debate on the ownership of online content and whether GenAI platforms are infringing on the intellectual property (IP) rights of organisations like news publications that put out significant amounts of updated and mostly accurate information on the Internet — the kind of data that GenAI platforms can really benefit from.
Story continues below this ad
The responses that AI platforms such as ChatGPT and Gemini generate rest on the bedrock of millions of pieces of textual content that creators, including news publishers, have uploaded online.
The music business — which is among the most sensitive towards IP rights — has also been pushing back on the use of AI in the industry. Universal Music Group has asked streaming services such as Spotify to stop developers from scraping its material to train AI bots in making new songs.
Copyright laws in countries around the world, including India, need drastic reimagining in the era of AI. In India, creative works are regulated by The Copyright Act of 1957, which defines an “author” (among other things) “in relation to any literary, dramatic, musical or artistic work which is computer-generated, the person who causes the work to be created”.
However, this definition does not take into account the fact that AI systems do not generate information on their own; they are only as good as the base dataset on which they are trained. And the base dataset is built out of copyrighted work produced by other authors.