What are dolphins saying to each other? What are they trying to tell us? Recent advancements in machine learning and large language models (LLMs) could be moving us closer to achieving the long-elusive goal of interspecies communication.
Google on Monday, April 14, announced that its foundational AI model called DolphinGemma will be made accessible to other researchers in the coming months. The tech giant claimed that this ‘open’ AI model has been trained to generate “novel dolphin-like sound sequences”, and will one day facilitate interactive communication between humans and dolphins.
“By identifying recurring sound patterns, clusters and reliable sequences, the model can help researchers uncover hidden structures and potential meanings within the dolphins’ natural communication — a task previously requiring immense human effort,” Google said in a blog post.
“Eventually, these patterns, augmented with synthetic sounds created by the researchers to refer to objects with which the dolphins like to play, may establish a shared vocabulary with the dolphins for interactive communication,” it added.
The AI model has been developed by Google in collaboration with AI researchers at Georgia Institute of Technology in Atlanta, US. It has been trained on datasets collected from field researchers working with the Wild Dolphin Project (WDP), a non-profit research organisation.
DolphinGemma is a lightweight, small language model with a parameter count of 400 million that makes it optimal to run on Pixel phones to be used by WDP researchers underwater, as per Google.
Its underlying technology comprises Google’s SoundStream tokenizer used to convert the dolphin sounds into a string of discrete, manageable units called tokens. The model architecture borrows from Google’s Gemma series of lightweight, ‘open’ AI models.
DolphinGemma is an audio-in and audio-out model, meaning that it processes sound rather than text and hence, cannot respond to written prompts.
Similar to how LLMs predict the next word or token in a sentence in human language, DolphinGemma analyses “sequences of natural dolphin sounds to identify patterns, structure and ultimately predict the likely subsequent sounds in a sequence,” Google said.
The company revealed that it plans on releasing DolphinGemma as an open model so that other researchers can fine-tune the model based on the sounds of various cetacean species such as bottlenose and spinner dolphins.
According to Google, the AI model was trained on WDP’s dataset of sounds made by the wild Atlantic spotted dolphin. This specific community of dolphins are said to be found in the Bahamas, an island country in the Caribbean.
The dataset used to train the AI model originated from underwater video footage and audio recordings collected over decades. This data was labelled by WDP researchers to specify individual dolphin identities as well as their life histories and observed behaviours.
Instead of making surface observations, WDP researchers went underwater to gather the data as they found that it helped them directly link the sounds made by the dolphins to their specific behaviours.
The DolphinGemma training dataset comprises unique dolphin sounds such as signature whistles (used by mothers to call their calves), burst-pulse squawks (usually heard when two dolphins are fighting), and click buzzes (often heard during courtships or chasing sharks).
In order to establish a shared vocabulary of dolphin sounds, Google said it teamed up with Georgia Tech researchers to develop the CHAT system.
CHAT is short for Cetacean Hearing Augmentation Telemetry. It is an underwater machine designed to link AI-generated dolphin sounds with specific objects that dolphins enjoy like seagrass.
Google said that the CHAT tool enables a two-way interaction between humans and dolphins by accurately hearing the dolphin sound whistle underwater, identifying the matching sequence of a sound whistle in its training dataset, and informing the human researcher (via underwater headphones) about the corresponding object that the dolphin had asked for.
This would enable the researcher to respond quickly by offering the correct object to the dolphin, reinforcing the connection between them, Google said.
“By demonstrating the system between humans, researchers hope the naturally curious dolphins will learn to mimic the whistles to request these items. Eventually, as more of the dolphins’ natural sounds are understood, they can also be added to the system,” the company added.
Google said its Pixel 6 series had shown it was capable of processing dolphin sounds in real-time. It said the next generation of the CHAT system would have specific speaker and microphone functions integrated with Pixel 9 smartphones and upgraded with advanced processing “to run both deep learning models and template matching algorithms simultaneously.”
“Using Pixel smartphones dramatically reduces the need for custom hardware, improves system maintainability, lowers power consumption and shrinks the device’s cost and size — crucial advantages for field research in the open ocean,” the tech giant said.
Researchers have been studying ways to leverage AI and machine learning algorithms in order to make sense of animal sounds for several years now.
They have had success applying automatic detection algorithms based on convolutional neural networks to pick out animal sounds and categorise them based on their acoustic characteristics.
Deep neural networks, on the other hand, have made it possible to find hidden structures in sequences of animal vocalisations. This has ensured that AI models trained on examples of animal sounds are capable of generating a unique, synthetic version of that animal sound. This type of models are known as supervised learning models.
Supervised learning models can generate made-up animal sounds based on lots of examples labelled and annotated by humans, with details such as the receiver of the animal call, context of the call, behaviour observations, etc. But what about animal sounds that are fed to the AI model and haven’t been labelled?
This is where self-supervised or unsupervised learning models come in. They are trained on vast amounts of unlabelled data pulled from books, websites, social media feeds, and anything else on the internet. Unsupervised learning models are then able to sort data and predict patterns in the data all on their own.
They are considered to be more advantageous for translating animal sounds into human language, since they might not be limited by what is already known about animal communication. Researchers also expect that the vast training datasets may contain animal sounds that have been previously inaccessible.
Google’s DolphinGemma appears to fall in the category of semi-supervised learning models, which are trained on labelled and unlabelled examples.
Yet, there are several major challenges in developing an AI chatbot that lets humans talk to animals. For instance, researchers have pointed out that animals likely communicate using more than just sound and incorporate other senses such as touch and smell.
Validating AI-generated animal sounds could also be challenging since there is still a lot that humans don’t understand about animal communication. Developing AI models for this purpose will also require a lot more data.