While image generation is one thing, there could be times when one wants to decipher an image of an old pamphlet, or a page from a book. Manually analysing the image may be difficult and time-consuming, and this is where GPT-4 Vision comes in handy.
In September 2023, OpenAI launched two new features that assist users in interacting with GPT-4, especially the ability to ask questions about images and to use speech as an input to a query. In November, OpenAI announced API access to GPT-4 with Vision.
Story continues below this ad
Here is a look at the underlying technology and limitations of GPT-4 Vision.
What is GPT-4 Vision?
GPT-4 with Vision, also referred to as GPT-4V, allows users to instruct GPT-4 to analyse image inputs. “Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development,” according to the research paper from OpenAI.
GPT-4 Vision has been considered OpenAI’s step forward towards making its chatbot multimodal — an AI model with a combination of image, text, and audio as inputs. It allows users to upload an image as input and ask a question about it. This task is known as visual question answering (VQA). GPT-4 Vision is a Large Multimodal Model or LMM, which is essentially a model that is capable of taking information in multiple modalities like text and images or text and audio and generating responses based on it. It is not the first and only LMM. There are many others such as CogVLM, LLaVA, Kosmos-2, etc. LMMs are also known as Multimodal Large Language Models (MLLMs).
What are its key capabilities?
Story continues below this ad
GPT-4 Vision has some groundbreaking capabilities such as processing visual content including photographs, screenshots, and documents. The latest iteration allows it to perform a slew of tasks such as identifying objects within images, and interpreting and analysing data displayed in graphs, charts, and other visualisations.
GPT-4 Vision can also interpret handwritten and printed text contained within images. This is a significant leap in AI as it, in a way, bridges the gap between visual understanding and textual analysis.
How can GPT-4 Vision help users?
The Indian Express has found that GPT-4 Vision can be a handy tool for researchers, web developers, data analysts, and content creators. With its integration of advanced language modelling with visual capabilities, GPT-4 Vision can help in academic research, especially in interpreting historical documents and manuscripts.
This task usually is time-consuming and carried out by teams of experts. But with GPT-4 Vision, documents that can be deciphered in a matter of seconds. Users can refine the results multiple times to ensure more accuracy.
Story continues below this ad
A diary used to make notes during meetings. The model immediately organised the entries, however, it misinterpreted three words and a sentence in the notes, likely due to illegible handwriting. The model did a fair job of extracting and cleaning the random entries in the diary.
Similarly, with GPT-4 Vision, developers can now write code for a website simply from a visual image of the design, which could even be a sketch. The model is capable of taking from a design on paper and creating code for a website. Data interpretation is another key area where the model can work wonders as the model lets one unlock insights based on visuals and graphics.
The model was also asked to analyse a graph of the top three gaming console players’ shipment data.
Lastly, the combination of GPT-4 Vision and DALL-E 3 can be beneficial for content creators to put together creative posts for social media.
What are the limitations of GPT-4 Vision?
OpenAI has acknowledged that while GPT-4 is a significant leap in accuracy and reliability, it is not always perfect. The model can make mistakes, and hence it is advisable to always verify content. GPT-4 Vision also continues to reinforce social bias and worldviews, according to its maker.
The model has been trained to avoid identifying specific individuals in images which OpenAI calls ‘refusal’ behaviour by design. The company has advised against its use for tasks that require precise scientific, medical, or sensitive content analysis due to its limitations and inconsistencies.