During its launch, Google enumerated how Gemini models scored or even outperformed GPT-4 in various benchmarks. This also included Massive Multitask Language Understanding (MMLU), a fundamental metric that is used to assess the AI model’s potential across a broad range of academic disciplines such as STEM, social science, math, and humanities.
Story continues below this ad
However, the research paper shared by Google showed that the Ultra version surpassed both GPT-4 and GPT-3.5. Interestingly, a closer inspection would reveal a key technical detail. Based on Reddy’s tweet, Google has deployed COT@32 in place of 5-shot learning to augment the performance of Gemini.
“The Gemini MMLU beat is specifically at CoT@32. GPT-4 still beats Gemini for the standard 5-shot – 86.4% vs. 83.7 per cent” Reddy wrote in her tweet. According to the CEO, 5-shot is the standard measure to evaluate this benchmark, and one prepends five examples in the prompt.
CoT stands for Chain of Thought prompting which involves offering a series of steps like a chain of thought to help it generate rationale to solve problems. CoT is aimed at improving the model’s multi-step reasoning capabilities. Meanwhile, 5-shot learning is when an AI model is trained using five examples of each class. This limited set of examples is training data from which an AI model is expected to recognise patterns.
Reddy, in her tweet, claimed that Google invented a different methodology around CoT@32 to claim that it is way better than GPT-4. The former AWS/Google staff said that Cot@32 only beats when you add in for “uncertainty routing”. Below are some more tweets from users who have flagged similar concerns with benchmarking.
The Gemini video isn’t all that real
Story continues below this ad
While unveiling Gemini, Google demonstrated a video of its multimodal and reasoning capabilities. Ever since the launch, there have been numerous reports that the video has not been relayed in real-time. Clint Ehrilch, who is an attorney and computer scientist, as per his X bio, shared a detailed tweet where he claimed that the video demo of Gemini was fake.
According to Ehrlich, three things about the video were exciting to the viewers such as that Gemini processed video and not just still images, it conferred context without being spoon-fed prompts, and it seamlessly spoke and understood conversational audio. Ehrlich said that none of these three facets were real.
Ehrlich in his thread said that Gemini did not process video instead it processed images, it required detailed prompting, and that it communicates best with written prompts and not audio. “You won’t get any of this from the viral video, but it’s spelled out in Google’s documentation for developers,” Ehrlich said. He went on to debunk sleight-of-hand tricks performed with a coin, the geography quiz, ball-and-cup shuffling game shown in the video.
In his post, Ehrlich asked if Google broke the law by showing a fake video. He went on to say that under Federal Trade Commission (FTC) standards, a disclaimer is necessary to prevent an ad from being misleading. “Under FTC standards, if a disclaimer is necessary to prevent an ad from being misleading, it must appear *in the ad.* A separate blog post doesn’t cut it,” he said in his tweet.
Claims of deceptions
Story continues below this ad
According to a report in Bloomberg Opinion, the output of Gemini is much slower compared to the demo. It said that although there was a disclaimer in the video saying that the responses had been sped up, however, that was not the biggest deception. According to the report, Gemini wasn’t even watching the video and all the responses heard in the video were its replies to still frames from the video and text prompts.
Following the incident, a Google spokesperson told Bloomberg Opinion that the video was made ‘using still image frames from the footage, and prompting via text’. Oriol Vinyals, who co-leads Gemini, took to X to claim that the video was intended to ‘inspire developers”.
“All the user prompts and outputs in the video are real, shortened for brevity. The video illustrates what the multimodal user experiences built with Gemini could look like. We made it to inspire developers.” he said in his tweet.
From a consumer perspective, the true assessment of Google Gemini will happen when the model becomes accessible on a wider scale. While Gemini AI have outperformed other AI models on many benchmarks, it’s important to acknowledge that its creators have asserted from the start that it is not perfect and that it is still evolving.