Artificial Intelligence is making strides across industries. And, San Francisco-based OpenAI has been leading the pack, ever since it announced ChatGPT in November 2022. Although the runaway success of ChatGPT triggered a race among tech giants to bring the best of generative AI in their products and services, subsequent upgrades and modifications from OpenAI have helped it race ahead of all its peers.
On September 25, the Sam Altman-led company announced voice and image capabilities for its sensational chatbot. The new feature now allows an intuitive type of interface by allowing users to have voice conversations or share images with the chatbot. This is the first time that OpenAI is doing something like this.
“Voice and image give you more ways to use ChatGPT in your life. Snap a picture of a landmark while travelling and have a live conversation about what’s interesting about it. When you’re home, snap pictures of your fridge and pantry to figure out what’s for dinner (and ask follow-up questions for a step-by-step recipe). After dinner, help your child with a math problem by taking a photo, circling the problem set, and having it share hints with both of you,” said the company in a blog post announcing the rollout of the feature to ChatGPT Plus and Enterprise users. Meanwhile, voice features are now available on iOS and Android.
In July, Google in a bid to stay ahead of Microsoft-backed OpenAI, Anthropic and many others, introduced multi-modality in its chatbot – Google Bard. The updates to Bard included analysis of images, different styles of responses, more languages etc. However, with ChatGPT Vision, OpenAI has yet again proved that it is the pioneer in AI innovation and a formidable force. The excitement around ChatGPT’s new features is similar to that was witnessed in November 2022 when the chatbot made its way to the public consciousness.
Why is it a big deal?
ChatGPT with vision isn’t out yet, but the people who have access to it are showing some mind-blowing capabilities with the new feature. The new capabilities make it one of the most fruitful AI product announcements in recent times. There is a gamut of ways that people are already discovering how to use this new tool. There are plenty of use cases for one to try when ChatGPT with vision becomes available to you.
Visual research with ChatGPT vision
AI enthusiast Rowan Cheung shared an image of a cave on ChatGPT and asked where it was located. ChatGPT responded by saying that – “…The image appears to be taken from inside a cave overlooking a coastline with a distinctively curving road. Based on the scenery and the characteristics of the landscape, it strongly resembles the view from Makapu’u Point on the island of Oahu in Hawaii…”
The accurate recognition prompted Cheung to tweet saying, “ChatGPT image recognition can find hidden gems.” Other users have also shared similar demonstrations on Twitter from asking for locations to identifying animals in a shot. So far, ChatGPT Vision seems to be performing well.
Story continues below this ad
With the integration of this feature into mobile, this is going to be a use case that millions of users are likely to opt for. This has all the potential to be a very standard feature of travel in future. Imagine seeing something, and pointing ChatGPT at it and saying what is that or tell me about that.
ChatGPT Vision for interior design
Another AI expert, Pietro Schirano, has done numerous experiments with ChatGPT Vision. In one of his tweets, Schirano posted a picture of his room and asked how he could improve this room. ChatGPT gave a number of suggestions to enhance from colour to plants to lighting to art, etc.
Custom instructions are the feature which allows users to give ChatGPT more information about themselves so that it can have context when it responds to future queries. This is evident from the bot’s response, precisely when it makes an Art suggestion where it says, “Given your background in classical studies and art, perhaps adding some artwork on the walls could be a great personal touch. They could be prints of classical artworks or something contemporary to create a blend of old and new.”
ChatGPT Vision as an expert developer
Pietro in his demonstration also showed how ChatGPT vision can build websites and write codes. He writes code from the image to the live website using GPT-4 Vision in less than a minute. He essentially shared a video of him posting an example of UI in a photo asking ChatGPT to replicate it without skipping anything, write code that he is able to later export and get it in an IDE almost instantly.
Story continues below this ad
Another user, McKay Wrigley did something similar when he gave a screenshot of a SaaS dashboard and asked the bot to write code for it. ChatGPT Vision moved from a screenshot to an actual working prototype within minutes. Wrigley later demonstrated how showing ChatGPT a picture of one’s team’s whiteboard session can be a prompt for it to write code. The video got close to 10 million views.
Reducing the gap between ideas and execution
Reading and explaining diagrams are a revolutionary aspect of ChatGPT Vision. A user identified as Sean Spriggens uses an unbelievably dense diagram which seems to be from the Pentagon titled ‘Integrated Defence Acquisition, Technology, and Logistics Life Cycle Management System’. The diagram shared by Spriggens has over 3,000 words and hundreds of boxes floating across the page. However, ChatGPT is able to make sense of it. Interestingly, some diagrams are entirely different types of information.
For instance, another user Marco Moscorro posted a diagram of the electronics (schematics) of the Arduino design, and ChatGPT with Vision was instantly able to understand that it was an electronic circuit and also effortlessly explained how different components were interconnected and worked.
This is also an incredible use case for educational purposes. It is to be noted that with Vision it is just not about the initial results, users can also interact and ask for further clarifications on the diagrams that they are trying to explore. Essentially OpenAI is enabling a dialogue between man and machine. This also has a flipside, as demonstrated by another AI expert Peter Yang. He shared an image of a math test and asked ChatGPT to give answers. The chatbot responded by offering accurate answers.
Story continues below this ad
“Kids will never do homework again,” tweeted Yang with the image showing the response from ChatGPT. Based on this, experts feel that if teachers can work around exercises that are actually valuable for children and are something that ChatGPT cannot perform then in all likelihood those tests can be more valuable in education.
More on ChatGPT’s new features
The new voice and image functionality definitely offers a more user-intuitive interface as users can now converse with the chatbot with their voice or by simply showing images. These dynamic interactions are seen as a landmark in AI as they may come in handy in everyday conversations such as discussing places to see or dinner suggestions based on ingredients in the kitchen. Besides, the novel text-to-speech model offers human-like audio creation.
When it comes to efficacies, the web browsing feature is not consistently accurate, however, ChatGPT with vision has seemingly impressed with its real-world applications. A recent research paper also demonstrated its capabilities such as identifying manufacturing defects, producing medical scan reports, assessing vehicle damage, etc. Despite occasional errors, GPT-4 with vision means a significant shift towards a visual AI assistant. Users are recommended to try the vision features using Bing Chat and GPT-4 to enhance their tasks.
While these features are insane, OpenAI is moving ahead with caution as it is also emphasising safety and mitigating risks as it deploys them. The vision-based models are tested extensively. When it comes to accessibility, collaborations such as ‘Be My Eyes’ for the visually impaired take the feature’s utility to new heights. The AI powerhouse has stressed transparency while acknowledging possible inaccuracies in terms of images with people. The company has said that it has taken measures to ensure user privacy.