Premium
This is an archive article published on April 10, 2024

Apple’s ‘Ferret-UI’ AI model could understand what’s on your screen in a new way

The research paper claims Ferret-UI outperforms both GPT-4V and other UI-focused LLMs.

apple ferret ui featuredThere are several potential use cases of such tech, including supercharged voice assistants. (Image: Apple research paper)

Apple has been working on an AI assistant that could change how we interact with apps and potentially automate a lot of tasks through Siri.

The system, dubbed Ferret-UI, is what’s known as a Multimodal Large Language Model (MLLM). This means it’s an AI that can understand more than just text – it perceives images, videos, and other forms of media. However, Ferret-UI’s multimodal capabilities have been designed such that the model can make sense of mobile app screens.

This is a big deal considering other LLMs find it hard to understand content within smartphone displays due to the various aspect ratios and the fact that each app is packed with tiny icons, buttons, and menus. Ferret-UI, however, has been meticulously schooled to comprehend these user interfaces. The research paper says the model is trained on a vast collection of UI tasks like icon recognition, text spotting, and widget listings.

And it seems to be working. Apple claims Ferret-UI outperforms both GPT-4V and other UI-focused MLLMs at understanding smartphone apps.

apple ferret ui inline (Image: Apple research paper)

So how could this AI benefit us? The paper is slim on those details but such a model can have several potential uses. For starters, app developers could use it to test how intuitive their creations really are before unleashing them on the public.

Accessibility features can also gain a major boost with such a feature. The AI can act as an advanced screen reader for those with visual impairments, describing app interfaces and available actions far more intelligently than basic text-to-speech tools. It could also summarise what’s on a screen, list the available actions, and even anticipate what a user might want to do next.

But perhaps the biggest draw is a supercharged Siri that could potentially navigate through apps, performing actions for you. The Rabbit R1 attempts to do something similar by automating the process of, say, hailing a cab through the Uber app. Siri could perform similar tasks with nothing but voice commands.

Story continues below this ad

So far, voice assistants’ scope is rather limited since they require manually plugging into APIs on the part of developers. But with tech like this, the scope of such assistants can be broadened significantly since they’ll be capable of understanding apps like humans do and performing actions directly on the screen.

Of course, this is all theory for now and it’s up to Apple to show us how they can leverage this breakthrough.

 

Latest Comment
Post Comment
Read Comments
Advertisement
Loading Taboola...
Advertisement