Premium
This is an archive article published on December 9, 2023

Google Gemini is no match for GPT-4: Fake demo video, shaky MMLU benchmark draw backlash

The demo video of Gemini has raised a few eyebrows on the internet. While Google has responded, there seems to be more to Gemini.

A snapshot of the Google Gemini videoGoogle Gemini which was released in three sizes boasts of exceptional multimodal capabilities. (Image: Google)

As the world continues to marvel at Google’s latest creation Gemini AI, dubbed a rival to OpenAI’s ChatGPT, trouble seems to be brewing for the biggest tech company in the world. Google introduced Gemini in three sizes, Ultra, Pro, and Nano, with Ultra being the most powerful as it reportedly overshadowed GPT-4 in numerous metrics. 

Bindu Reddy, CEO of AbacusAI, took to her X profile to share her unique observation. “Digging deeper into the MMLU Gemini Beat – Gemini doesn’t really Beat GPT-4 On This Key Benchmark,” she wrote in a long post where she went on to explain why Ultra was not as good as it is claimed to be. 

During its launch, Google enumerated how Gemini models scored or even outperformed GPT-4 in various benchmarks. This also included Massive Multitask Language Understanding (MMLU), a fundamental metric that is used to assess the AI model’s potential across a broad range of academic disciplines such as STEM, social science, math, and humanities.

Story continues below this ad

However, the research paper shared by Google showed that the Ultra version surpassed both GPT-4 and GPT-3.5. Interestingly, a closer inspection would reveal a key technical detail. Based on Reddy’s tweet, Google has deployed COT@32 in place of 5-shot learning to augment the performance of Gemini. 

“The Gemini MMLU beat is specifically at CoT@32. GPT-4 still beats Gemini for the standard 5-shot – 86.4% vs. 83.7 per cent” Reddy wrote in her tweet. According to the CEO, 5-shot is the standard measure to evaluate this benchmark, and one prepends five examples in the prompt.

CoT stands for Chain of Thought prompting which involves offering a series of steps like a chain of thought to help it generate rationale to solve problems. CoT is aimed at improving the model’s multi-step reasoning capabilities. Meanwhile, 5-shot learning is when an AI model is trained using five examples of each class. This limited set of examples is training data from which an AI model is expected to recognise patterns.  

Reddy, in her tweet, claimed that Google invented a different methodology around CoT@32 to claim that it is way better than GPT-4. The former AWS/Google staff said that Cot@32 only beats when you add in for “uncertainty routing”. Below are some more tweets from users who have flagged similar concerns with benchmarking.

The Gemini video isn’t all that real

Story continues below this ad

While unveiling Gemini, Google demonstrated a video of its multimodal and reasoning capabilities. Ever since the launch, there have been numerous reports that the video has not been relayed in real-time. Clint Ehrilch, who is an attorney and computer scientist, as per his X bio, shared a detailed tweet where he claimed that the video demo of Gemini was fake. 

According to Ehrlich, three things about the video were exciting to the viewers such as that Gemini processed video and not just still images, it conferred context without being spoon-fed prompts, and it seamlessly spoke and understood conversational audio. Ehrlich said that none of these three facets were real. 

Ehrlich in his thread said that Gemini did not process video instead it processed images, it required detailed prompting, and that it communicates best with written prompts and not audio. “You won’t get any of this from the viral video, but it’s spelled out in Google’s documentation for developers,” Ehrlich said. He went on to debunk sleight-of-hand tricks performed with a coin, the geography quiz, ball-and-cup shuffling game shown in the video.

In his post, Ehrlich asked if Google broke the law by showing a fake video. He went on to say that under Federal Trade Commission (FTC) standards, a disclaimer is necessary to prevent an ad from being misleading. “Under FTC standards, if a disclaimer is necessary to prevent an ad from being misleading, it must appear *in the ad.* A separate blog post doesn’t cut it,” he said in his tweet. 

Claims of deceptions

Story continues below this ad

According to a report in Bloomberg Opinion, the output of Gemini is much slower compared to the demo. It said that although there was a disclaimer in the video saying that the responses had been sped up, however, that was not the biggest deception. According to the report, Gemini wasn’t even watching the video and all the responses heard in the video were its replies to still frames from the video and text prompts. 

Following the incident, a Google spokesperson told Bloomberg Opinion that the video was made ‘using still image frames from the footage, and prompting via text’. Oriol Vinyals, who co-leads Gemini, took to X to claim that the video was intended to ‘inspire developers”. 

“All the user prompts and outputs in the video are real, shortened for brevity. The video illustrates what the multimodal user experiences built with Gemini could look like. We made it to inspire developers.” he said in his tweet. 

From a consumer perspective, the true assessment of Google Gemini will happen when the model becomes accessible on a wider scale. While Gemini AI have outperformed other AI models on many benchmarks, it’s important to acknowledge that its creators have asserted from the start that it is not perfect and that it is still evolving.

Bijin Jose, an Assistant Editor at Indian Express Online in New Delhi, is a technology journalist with a portfolio spanning various prestigious publications. Starting as a citizen journalist with The Times of India in 2013, he transitioned through roles at India Today Digital and The Economic Times, before finding his niche at The Indian Express. With a BA in English from Maharaja Sayajirao University, Vadodara, and an MA in English Literature, Bijin's expertise extends from crime reporting to cultural features. With a keen interest in closely covering developments in artificial intelligence, Bijin provides nuanced perspectives on its implications for society and beyond. ... Read More

Latest Comment
Post Comment
Read Comments
Advertisement
Loading Taboola...
Advertisement