What is OpenAI’s GPT-4 Vision and how can it help you interpret images, charts?

Handwritten or old decrepit manuscripts, GPT4-Vision from OpenAI makes it easy for one to analyse images and docs. Here is how it works.

GPT-4 Vision has been considered OpenAI’s step forward towards making its chatbot multimodal, an AI model with a combination of image, text, and audio as inputs. (Photo: OpenAI)

Following its launch, OpenAI’s ChatGPT has evolved by leaps and bounds — now churning text is not the only function, it can also create images from natural language prompts, thanks to the integration of DALL-E.

While image generation is one thing, there could be times when one wants to decipher an image of an old pamphlet, or a page from a book. Manually analysing the image may be difficult and time-consuming, and this is where GPT-4 Vision comes in handy.

In September 2023, OpenAI launched two new features that assist users in interacting with GPT-4, especially the ability to ask questions about images and to use speech as an input to a query. In November, OpenAI announced API access to GPT-4 with Vision.

Here is a look at the underlying technology and limitations of GPT-4 Vision.

What is GPT-4 Vision?

GPT-4 with Vision, also referred to as GPT-4V, allows users to instruct GPT-4 to analyse image inputs. “Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development,” according to the research paper from OpenAI.

GPT-4 Vision has been considered OpenAI’s step forward towards making its chatbot multimodal — an AI model with a combination of image, text, and audio as inputs. It allows users to upload an image as input and ask a question about it. This task is known as visual question answering (VQA). GPT-4 Vision is a Large Multimodal Model or LMM, which is essentially a model that is capable of taking information in multiple modalities like text and images or text and audio and generating responses based on it. It is not the first and only LMM. There are many others such as CogVLM, LLaVA, Kosmos-2, etc. LMMs are also known as Multimodal Large Language Models (MLLMs).

Also in Explained | Meet EVI, the world’s first conversational AI with emotional intelligence from Hume

What are its key capabilities?

Story continues below this ad

GPT-4 Vision has some groundbreaking capabilities such as processing visual content including photographs, screenshots, and documents. The latest iteration allows it to perform a slew of tasks such as identifying objects within images, and interpreting and analysing data displayed in graphs, charts, and other visualisations.

GPT-4 Vision can also interpret handwritten and printed text contained within images. This is a significant leap in AI as it, in a way, bridges the gap between visual understanding and textual analysis.

How can GPT-4 Vision help users?

The Indian Express has found that GPT-4 Vision can be a handy tool for researchers, web developers, data analysts, and content creators. With its integration of advanced language modelling with visual capabilities, GPT-4 Vision can help in academic research, especially in interpreting historical documents and manuscripts.

This task usually is time-consuming and carried out by teams of experts. But with GPT-4 Vision, documents that can be deciphered in a matter of seconds. Users can refine the results multiple times to ensure more accuracy.

Story continues below this ad

A diary used to make notes during meetings. The model immediately organised the entries, however, it misinterpreted three words and a sentence in the notes, likely due to illegible handwriting. The model did a fair job of extracting and cleaning the random entries in the diary.

Similarly, with GPT-4 Vision, developers can now write code for a website simply from a visual image of the design, which could even be a sketch. The model is capable of taking from a design on paper and creating code for a website. Data interpretation is another key area where the model can work wonders as the model lets one unlock insights based on visuals and graphics.

A graph of the top three gaming console players shipment data and ChatGPT's response.

The model was also asked to analyse a graph of the top three gaming console players’ shipment data.

Lastly, the combination of GPT-4 Vision and DALL-E 3 can be beneficial for content creators to put together creative posts for social media.

What are the limitations of GPT-4 Vision?

OpenAI has acknowledged that while GPT-4 is a significant leap in accuracy and reliability, it is not always perfect. The model can make mistakes, and hence it is advisable to always verify content. GPT-4 Vision also continues to reinforce social bias and worldviews, according to its maker.

The model has been trained to avoid identifying specific individuals in images which OpenAI calls ‘refusal’ behaviour by design. The company has advised against its use for tasks that require precise scientific, medical, or sensitive content analysis due to its limitations and inconsistencies.

Bijin Jose

Bijin Jose serves as an Assistant Editor at Indian Express Online in New Delhi. A seasoned technology journalist with a diverse portfolio, he brings over a decade of experience in the media industry to his coverage of the evolving digital landscape and emerging technologies. Experience & Career Bijin commenced his journalistic journey in 2013 as a citizen journalist with The Times of India. His career trajectory includes significant tenures at prestigious media organizations including India Today Digital and The Economic Times. This diverse professional background, ranging from legacy print institutions to dynamic digital platforms, culminated in his current leadership role at The Indian Express, where he helps shape the publication's technology narrative. Expertise & Focus Areas Bijin has transitioned from general reporting to a specialized focus on the intersection of technology and humanity. His key areas of expertise include: Artificial Intelligence: deeply tracking developments in AI, providing nuanced perspectives on its ethical,industrial, and societal implications. Tech Commentary: moving beyond product specifications to analyze how technology reshapes daily life. Diverse Reporting Foundation: draws upon a robust background in crime reporting and cultural features to bring a human-centric approach to technical storytelling. Authoritativeness & Trust Bijin’s editorial voice is informed by a strong academic foundation, holding a Bachelor of Arts in English from Maharaja Sayajirao University, Vadodara, and a Master of Arts in English Literature. This literary background enables him to deconstruct complex technical jargon into accessible, compelling narratives. His steady progression through India’s top newsrooms underscores his reputation for editorial rigor and reliable journalism. Find all stories by Bijin Jose here ... Read More