Premium
This is an archive article published on April 11, 2024

What is OpenAI’s GPT-4 Vision and how can it help you interpret images, charts?

Handwritten or old decrepit manuscripts, GPT4-Vision from OpenAI makes it easy for one to analyse images and docs. Here is how it works.

GPT-4GPT-4 Vision has been considered OpenAI’s step forward towards making its chatbot multimodal, an AI model with a combination of image, text, and audio as inputs. (Photo: OpenAI)

Following its launch, OpenAI’s ChatGPT has evolved by leaps and bounds now churning text is not the only function, it can also create images from natural language prompts, thanks to the integration of DALL-E.

While image generation is one thing, there could be times when one wants to decipher an image of an old pamphlet, or a page from a book. Manually analysing the image may be difficult and time-consuming, and this is where GPT-4 Vision comes in handy.

In September 2023, OpenAI launched two new features that assist users in interacting with GPT-4, especially the ability to ask questions about images and to use speech as an input to a query. In November, OpenAI announced API access to GPT-4 with Vision.

Story continues below this ad

Here is a look at the underlying technology and limitations of GPT-4 Vision.

What is GPT-4 Vision?

GPT-4 with Vision, also referred to as GPT-4V, allows users to instruct GPT-4 to analyse image inputs. “Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development,” according to the research paper from OpenAI.

GPT-4 Vision has been considered OpenAI’s step forward towards making its chatbot multimodal — an AI model with a combination of image, text, and audio as inputs. It allows users to upload an image as input and ask a question about it. This task is known as visual question answering (VQA). GPT-4 Vision is a Large Multimodal Model or LMM, which is essentially a model that is capable of taking information in multiple modalities like text and images or text and audio and generating responses based on it. It is not the first and only LMM. There are many others such as CogVLM, LLaVA, Kosmos-2, etc. LMMs are also known as Multimodal Large Language Models (MLLMs).

What are its key capabilities?

Story continues below this ad

GPT-4 Vision has some groundbreaking capabilities such as processing visual content including photographs, screenshots, and documents. The latest iteration allows it to perform a slew of tasks such as identifying objects within images, and interpreting and analysing data displayed in graphs, charts, and other visualisations.

GPT-4 Vision can also interpret handwritten and printed text contained within images. This is a significant leap in AI as it, in a way, bridges the gap between visual understanding and textual analysis.

How can GPT-4 Vision help users?

The Indian Express has found that GPT-4 Vision can be a handy tool for researchers, web developers, data analysts, and content creators. With its integration of advanced language modelling with visual capabilities, GPT-4 Vision can help in academic research, especially in interpreting historical documents and manuscripts.

This task usually is time-consuming and carried out by teams of experts. But with GPT-4 Vision, documents that can be deciphered in a matter of seconds. Users can refine the results multiple times to ensure more accuracy.

Story continues below this ad
A diary used to make notes during meetings. A diary used to make notes during meetings. The model immediately organised the entries, however, it misinterpreted three words and a sentence in the notes, likely due to illegible handwriting. The model did a fair job of extracting and cleaning the random entries in the diary.

Similarly, with GPT-4 Vision, developers can now write code for a website simply from a visual image of the design, which could even be a sketch. The model is capable of taking from a design on paper and creating code for a website. Data interpretation is another key area where the model can work wonders as the model lets one unlock insights based on visuals and graphics.

A graph of the top three gaming console players shipment data and ChatGPT's response. The model was also asked to analyse a graph of the top three gaming console players’ shipment data.

Lastly, the combination of GPT-4 Vision and DALL-E 3 can be beneficial for content creators to put together creative posts for social media.

What are the limitations of GPT-4 Vision?

OpenAI has acknowledged that while GPT-4 is a significant leap in accuracy and reliability, it is not always perfect. The model can make mistakes, and hence it is advisable to always verify content. GPT-4 Vision also continues to reinforce social bias and worldviews, according to its maker.

The model has been trained to avoid identifying specific individuals in images which OpenAI calls ‘refusal’ behaviour by design. The company has advised against its use for tasks that require precise scientific, medical, or sensitive content analysis due to its limitations and inconsistencies.

Bijin Jose, an Assistant Editor at Indian Express Online in New Delhi, is a technology journalist with a portfolio spanning various prestigious publications. Starting as a citizen journalist with The Times of India in 2013, he transitioned through roles at India Today Digital and The Economic Times, before finding his niche at The Indian Express. With a BA in English from Maharaja Sayajirao University, Vadodara, and an MA in English Literature, Bijin's expertise extends from crime reporting to cultural features. With a keen interest in closely covering developments in artificial intelligence, Bijin provides nuanced perspectives on its implications for society and beyond. ... Read More

Latest Comment
Post Comment
Read Comments
Advertisement
Advertisement
Advertisement
Advertisement